Title: ByteHouse Locale: en URL: https://sensorswave.cn/en/docs/data-center/pipeline/destinations/bytehouse/ Description: Configure a ByteHouse export pipeline to regularly sync event and user data to Volcengine ByteHouse The ByteHouse connector syncs event and user data from SensorsWave to your ByteHouse cloud data warehouse on a recurring schedule, enabling your data team to run deep analytics and join with other business datasets. > **Supported edition**: Only **Volcengine ByteHouse Cloud** is supported. The Enterprise Edition and on-premises (private) deployments are not supported. Typical use cases: - Sync user behavior events to ByteHouse to join with orders and CRM data - Export user profile data to build BI reports - Retain historical event data from SensorsWave to meet compliance or long-term analytics requirements The ByteHouse connector supports two export types: - **Event export**: Incrementally syncs user behavior events to the `sensorswave_events` table (default, customizable at creation) - **User export**: Incrementally syncs user profile attributes to the `sensorswave_users` table (default, customizable at creation) Each export type requires a separate pipeline. SensorsWave uses TOS (Volcengine Object Storage) as a staging area: incremental data is first uploaded to your TOS bucket, and ByteHouse then loads it into the target table via the `CnchS3` table function. In addition to ByteHouse itself, you also need a TOS bucket with the appropriate permissions. ## Prerequisites Before creating a pipeline in SensorsWave, complete the following setup in your Volcengine account. We recommend creating a dedicated Database, Virtual Warehouse, Role, and User for SensorsWave to follow the principle of least privilege. The examples below use the following object names; adjust as needed for your environment: | Object | Example | Description | |--------|---------|-------------| | Database | `sensorswave` | Target database for synced data | | Virtual Warehouse | `sensorswave_vw` | Compute resource the pipeline runs on | | Role | `sensorswave_role` | Role that holds the pipeline permissions | | User | `sensorswave_user` | Account used by the pipeline | | TOS Bucket | `sensorswave` | Object storage bucket used as staging | ### Step 1: Configure ByteHouse Base Objects in the Volcengine Console All of the following steps are performed in the **Volcengine → ByteHouse Console**; no SQL is required. 1. **Create the Database**: In "Database Management → New Database", create `sensorswave`. 2. **Create a Virtual Warehouse**: In "Virtual Warehouse Management → New Warehouse", create `sensorswave_vw`. Choose a size that matches your data volume. 3. **Create a Role**: In "Permission Management → Roles → Create Role", create `sensorswave_role`. 4. **Grant permissions to the role**: Grant `sensorswave_role` the following permissions on the `sensorswave` database: - `CREATE TABLE` - `DROP TABLE` - `INSERT` - `ALTER` - `SELECT` (optional) 5. **Create the User**: In "Permission Management → Users → Create User", create `sensorswave_user` with the following settings: - **Virtual Warehouse access**: Select `sensorswave_vw` from the previous step to grant the user permission to run SQL on that warehouse. (This grant is configured on the user, not via the role.) - **Authentication method** (choose one): - Password authentication: Set a strong password. Provide `Username` and `Password` in the pipeline configuration. - IAM authentication: Generate an **API Key** after creation. Provide the `API Key` in the pipeline configuration. 6. **Grant the role to the user**: After the user is created, go to "Permission Management → Roles", open `sensorswave_role`, and use the **Grant to User** action to add `sensorswave_user` as a member of the role. ### Step 2: Grant S3 and CREATE TEMPORARY TABLE Privileges `S3` (a Source-class privilege) and `CREATE TEMPORARY TABLE` (session-level temporary tables) do not belong to any Database in ByteHouse's authorization model. The Volcengine Console does not expose these grants, so you must grant them via SQL. Sign in with an **admin-level account** and run the following statement either in the ByteHouse admin console's SQL editor or via [bytehouse-cli](https://www.volcengine.com/docs/6517/102213?lang=en): ```sql GRANT S3, CREATE TEMPORARY TABLE ON *.* TO sensorswave_role; ``` ### Step 3: Prepare a TOS Bucket and Access Key SensorsWave uses TOS as a staging object storage tier that ByteHouse reads via `CnchS3`. In the Volcengine TOS console: 1. **Create the bucket**: It is recommended to create the TOS bucket in the same region as your ByteHouse instance (for example, `cn-beijing`) to avoid extra latency and cross-region traffic costs. 2. **Generate Access Keys**: Create an AK/SK pair under "Access Control → Access Keys" for SensorsWave to use. Grant that AK/SK the following minimum permissions (scoped to the bucket above): - `tos:PutObject` - `tos:GetObject` - `tos:DeleteObject` - `tos:ListBucket` ## Create a Pipeline After completing the ByteHouse and TOS prerequisites, create a pipeline in SensorsWave. 1. Go to **Data Center → Pipeline** and click **New Pipeline**. 2. In the dialog, select **Export Pipeline** and click **Next**. 3. Select the **data source type**: - **Events**: Sync user behavior events - **Users**: Sync user profile attributes 4. Choose **ByteHouse** as the **connector**. 5. Fill in the ByteHouse and TOS connection settings (see "Configuration Reference" below). 6. Click **Test Connection** and wait for verification to pass. 7. Configure the **execution frequency** (see "Execution Frequency" below). 8. To sync only specific events, configure event filtering in **Advanced Options** (events export only). 9. Click **Save**. The pipeline starts immediately and automatically creates the target table in ByteHouse. ## Configuration Reference ### Connection Settings **ByteHouse configuration**: | Field | Required | Description | Example | |-------|----------|-------------|---------| | `Host` | Yes | Public ByteHouse endpoint | `tenant-2100000000-cn-beijing-public.bytehouse.volces.com` | | `Port` | Yes | ByteHouse **TCP connection port** (not the HTTP port); defaults to `19000` | `19000` | | `Database` | Yes | Target database name | `sensorswave` | | `Virtual Warehouse` | No | Compute warehouse; leave blank to use the default warehouse bound to the user in the Volcengine console | `sensorswave_vw` | | `Authentication Type` | Yes | `IAM (API Key)` or `Password` | `Password` | | `Account ID` | Yes | Volcengine Account ID (required for both auth modes) | `2100000000` | | `Username` | Password only | ByteHouse username | `sensorswave_user` | | `Password` | Password only | ByteHouse password | — | | `API Key` | IAM only | IAM user's API Key | — | | `Table` | Yes | Target table name, customizable; defaults to `sensorswave_events` (events) or `sensorswave_users` (users); created automatically on save | `sensorswave_events` | **TOS configuration**: | Field | Required | Description | Example | |-------|----------|-------------|---------| | `Endpoint` | Yes | Public TOS endpoint (TOS native format, not the S3-compatible format) | `tos-cn-beijing.volces.com` | | `Region` | Yes | TOS Region ID; see the [Volcengine region and endpoint reference](https://www.volcengine.com/docs/6349/107356?lang=en) for valid values | `cn-beijing` | | `Bucket` | Yes | TOS bucket name | `sensorswave` | | `Access Key` | Yes | TOS AK | `AKLT...` | | `Secret Key` | Yes | TOS SK | — | > **Note**: `Password`, `API Key`, and `Secret Key` are sensitive credentials. SensorsWave encrypts them at rest and does not display them again after saving. When editing an existing pipeline, these fields show "Configured; leave blank to keep unchanged"—leave them blank unless you want to change the value. ### Execution Frequency **Interval Execution**: Triggered at fixed intervals (1–23 hours) starting from when the pipeline is created. | Option | Recommended Use | |--------|----------------| | Every 1 hour (default) | Most analytics use cases | | Every 2 hours | Reduce frequency for high-volume data | | Every 6 hours | Regular incremental sync | | Every 12 hours | Low-frequency updates such as user profiles | | Every 24 hours | Daily batch export | **Scheduled Execution**: Triggered at a fixed time in a specified timezone, ideal for off-peak execution windows. Select a timezone and time; the pipeline runs once daily at that moment. The timezone only affects trigger time calculation — see [Timezone Reference](#timezone-reference) for details. ## Data Model ### Events Table (default `sensorswave_events`) Event data is written to this table in append-only mode, partitioned by day. The examples below use the default table name. | Column | Type | Description | |--------|------|-------------| | `time` | `DateTime64(3)` | Event timestamp | | `distinct_id` | `String` | Client-side user identifier | | `trace_id` | `String` | Request trace ID | | `ssid` | `Int64` | Server-side user identifier (SSID) | | `anon_id` | `String` | Anonymous ID | | `login_id` | `String` | Login ID | | `event` | `String` | Event name, e.g., `$pageview`, `purchase` | | `properties` | `JSONB` | Event properties JSON | | `user_properties` | `JSONB` | Snapshot of user properties at event time | | `received_at` | `DateTime64(3)` | Server receive time | The timezone and format of time columns and datetime fields inside `properties` / `user_properties` are covered in [Timezone Reference](#timezone-reference). Example: querying `JSONB` properties: ```sql -- Query order_id and total_amount from event properties SELECT time, event, jsonb_extract_string(properties, '$.order_id') AS order_id, jsonb_extract_float64(properties, '$.total_amount') AS total_amount FROM sensorswave.sensorswave_events WHERE event = 'purchase' AND time >= '2026-04-01' LIMIT 100; ``` ### Users Table (default `sensorswave_users`) User data is written to this table using upsert by `ssid` (update if exists, insert if not). | Column | Type | Description | |--------|------|-------------| | `ssid` | `Int64` | Server-side user identifier, primary key | | `login_id` | `String` | Login ID | | `anon_id` | `String` | Anonymous ID | | `properties` | `JSONB` | User profile attributes JSON | | `created_at` | `DateTime64(3)` | User creation time | | `updated_at` | `DateTime64(3)` | Last updated time for user attributes | Example: querying `JSONB` user properties: ```sql -- Query membership level from user properties SELECT ssid, login_id, jsonb_extract_string(properties, '$.membership_level') AS membership_level, updated_at FROM sensorswave.sensorswave_users WHERE jsonb_extract_string(properties, '$.membership_level') != '' LIMIT 100; ``` ## Timezone Reference Several values in a ByteHouse pipeline carry timezone semantics. This section is the single source of truth for all of them. ### At a Glance | Value | Timezone | Controlled By | |-------|----------|---------------| | Time columns (`time`, `received_at`, `created_at`, `updated_at`) | SaaS server timezone (`Asia/Shanghai` for CN, `UTC` for overseas) | SensorsWave cluster, not configurable | | Datetime fields inside `properties` / `user_properties` | Same as time columns (SaaS server timezone) | Matches time columns | | Backfill window splitting | SaaS server timezone | Fixed, not configurable | | Scheduled execution trigger time | User-configured timezone | Set when creating the pipeline | ### Time Columns All `DateTime64(3)` columns (`time`, `received_at`, `created_at`, `updated_at`) are bound to the **SaaS server timezone** with **millisecond precision** (matching the precision of datetime fields inside `properties` / `user_properties`). The full column type is `DateTime64(3, '{system_tz}')`, where `{system_tz}` is injected based on the cluster: `Asia/Shanghai` (UTC+8) for CN clusters, `UTC` for overseas clusters. **Example (CN cluster)**: If the SaaS server timezone is `Asia/Shanghai` and an event occurred at `2026-04-13 15:30:00.123` Beijing time, ByteHouse displays `time` as `2026-04-13 15:30:00.123`. The column cannot be rendered in a different timezone inside ByteHouse; if you need cross-timezone analysis, convert query results at the application layer or in your BI tool. ### Datetime Fields Inside JSONB Datetime-typed event or user properties are serialized into `properties` / `user_properties` as strings: - **Format**: `YYYY-MM-DD HH:MM:SS.fff` (millisecond precision, no timezone offset) - **Timezone**: Matches the time columns (SaaS server timezone) **Example**: `{"purchase_at": "2026-04-13 15:30:00.000"}` ### Backfill Window Splitting Timezone ByteHouse backfill splits the date range into calendar-day windows using the **SaaS server timezone** (`Asia/Shanghai` for CN, `UTC` for overseas) and processes them serially. Each window is a **left-closed, right-open** interval `[start, end)` — the `start` instant is included and the `end` instant is excluded, so adjacent windows never double-count events on the boundary. **Example (CN cluster)**: Back-filling `2026-04-01 ~ 2026-04-03` produces three windows using `Asia/Shanghai` (all half-open): - Window 1: `[2026-04-01 00:00:00, 2026-04-02 00:00:00)` (Asia/Shanghai) - Window 2: `[2026-04-02 00:00:00, 2026-04-03 00:00:00)` (Asia/Shanghai) - Window 3: `[2026-04-03 00:00:00, 2026-04-04 00:00:00)` (Asia/Shanghai) ### Scheduled Execution Trigger Timezone Scheduled execution requires a timezone. For example, `Asia/Shanghai` 08:00 fires at 8 AM Beijing time daily. This timezone only affects trigger time calculation; it does not affect time column values or backfill window splitting. ## Event Filtering Event filtering applies to **event exports only** and supports two modes: - **Include events (allowlist)**: Export only the events listed; all other events are skipped - **Exclude events (blocklist)**: Export all events except those listed Configure the filter mode and event names (e.g., `$pageview`, `purchase`) in **Advanced Options**. > **Note**: Filtering changes take effect in the next sync window after the current window completes; already-exported data is not reprocessed. ## Test Connection After clicking **Test Connection**, the system runs the following 7 end-to-end verification steps against both ByteHouse and TOS: 1. **TCP connect**: Verify `Host` / `Port` are reachable. 2. **Authenticate**: Connect to ByteHouse using the configured auth method. 3. **Virtual Warehouse**: Switch to the specified warehouse (if `Virtual Warehouse` is set). 4. **Database**: Switch to the specified `Database`. 5. **Table check**: Create and drop a temporary probe table in the target database to verify table-creation permissions. 6. **TOS write**: Upload a test file to the TOS bucket to verify TOS AK/SK permissions. 7. **CnchS3 read**: Have ByteHouse read the test file from TOS, validating the `S3` grant and TOS connectivity end to end. Each step's result is shown in real time to help pinpoint issues. The entire test completes within 3 minutes. ### Common Connection Errors | Error | Possible Cause | Resolution | |-------|---------------|------------| | TCP connect failed | Wrong `Host` / `Port`, or network unreachable | Verify that public access is enabled on the ByteHouse instance, `Host` matches the public endpoint shown in the console, and `Port` is the TCP port (default `19000`) | | Authentication failed | Wrong `Account ID`, `Username` / `Password`, or `API Key` | Re-check the Volcengine account ID and user credentials; for IAM, confirm the `API Key` has not expired | | Virtual Warehouse access denied | The user does not have usage permission on the warehouse | In "Permission Management → Users", edit `sensorswave_user` and select `sensorswave_vw` | | No CREATE TABLE permission | The role lacks `CREATE TABLE` | In the console, add `CREATE TABLE` / `DROP TABLE` to `sensorswave_role` | | CnchS3 read failed | Role lacks the `S3` privilege, or TOS AK/SK lacks read permissions | Run `GRANT S3, CREATE TEMPORARY TABLE ON *.* TO sensorswave_role` in the ByteHouse admin console or via bytehouse-cli; check the IAM policy on your TOS AK/SK | | TOS write failed | TOS AK/SK lacks write permission, or wrong `Bucket` / `Region` | Verify bucket name and region; grant write permission on the TOS AK/SK | ## Pipeline Management and Monitoring ### Pipeline Status A pipeline's lifecycle states are: ``` New → Running → Stopped (irreversible) ``` > **Note**: Stopping a pipeline is irreversible. All scheduling stops and no new sync tasks are created. To resume exporting, you must create a new pipeline. Historical run records remain accessible after stopping. ### Viewing Run History In the pipeline detail page's **Runs** tab, view execution details for each sync task: - **Status**: Running / Success / Failed - **Data range**: The time window covered by this sync - **Rows exported**: Number of rows successfully written to ByteHouse - **Execution time**: Task start and end timestamps - **Logs**: Click a row to expand and view detailed execution logs ### Viewing Metrics The **Metrics** tab shows overall pipeline performance (defaults to the last 30 days): - Success / failure run count trend charts - Daily exported rows trend chart - Supports custom date range filtering ## Historical Data Backfill ### What is Backfill After a pipeline is created, incremental sync only processes new data from the pipeline creation time forward. To load data for a period before the pipeline was created, use the backfill feature. ### Steps 1. In the pipeline detail page, switch to the **Backfill** tab. 2. Click **Start Backfill** and select the date range to backfill (start date and end date). 3. Confirm the date range and click **Confirm**. > **Note**: Backfill splits calendar-day windows in the **SaaS server timezone**. See [Timezone Reference](#timezone-reference) for details. ### Backfill and Incremental Sync Interaction - While a backfill is running, the pipeline's scheduled incremental sync **automatically pauses**—no data is lost; sync resumes from where it left off after the backfill completes or is cancelled - Only one backfill can run at a time per pipeline - If a backfill is interrupted (e.g., service restart), it automatically resumes from the last completed window—completed windows are not re-processed ### Cancel Backfill Click **Cancel** while a backfill is in progress to stop it. The cancellation takes effect after the current window finishes; the in-progress window is not interrupted. ## Delivery Semantics and Data Consistency | Data Type | Task Type | Write Method | Delivery Semantics | |-----------|-----------|-------------|-------------------| | Events | Scheduled incremental sync | Append | at-least-once | | Events | Historical backfill | Idempotent overwrite per window | exactly-once (within a window) | | Users | Scheduled incremental sync / Historical backfill | Upsert by `ssid` | exactly-once | ## Notes - **Stopping a pipeline is irreversible**: Stopping requires creating a new pipeline to resume; historical run records are retained - **Credentials are encrypted at rest**: `Password`, `API Key`, and `Secret Key` are encrypted and never displayed again; leaving them blank on edit preserves the current value - **Target table is auto-created**: The target table is created automatically in ByteHouse when you save the pipeline—no manual table creation needed - **Backfill timezone**: ByteHouse backfill splits calendar days in the SaaS server timezone; plan your backfill date range accordingly - **Incremental sync pauses during backfill**: Scheduled sync is paused while backfill runs and resumes automatically after—no data is lost - **Time columns are in local time**: ByteHouse time columns store the SaaS server timezone, not UTC; account for this when doing cross-timezone analysis - **Array property semantics**: For `Array`-typed property fields inside `properties` / `user_properties`, a source value of `NULL` is always materialized as `[]` in ByteHouse; if your analytics need to distinguish `NULL` from `[]`, avoid relying on that difference at query time - **Duplicate event rows**: `sensorswave_events` may contain a small number of duplicate rows. **Why this happens**: Scheduled incremental sync for events uses append writes with at-least-once delivery semantics. Duplicates come primarily from two scenarios. - **Scenario 1: Incremental task retries on failure**. If a scheduled incremental sync task hits a node failure, network error, or ByteHouse write timeout, it retries the current batch to avoid data loss. If the previous attempt actually succeeded in part or in full, the retry appends duplicate rows. This scenario is rare and limited in scope. - **Scenario 2: Backfill overlaps with incremental sync**. Historical backfill runs as a separate pipeline from scheduled incremental sync, and the two write independently. The `time` field records when the event actually occurred, while incremental sync advances by the event's server-side ingestion time. When both pipelines cover the same late-arriving event, duplicates appear. If an event occurs at `T` but is ingested at `T + N`: 1. You trigger a backfill whose window is split by event occurrence time and covers `T`. The event is already in historical storage (ingested at `T + N`), so the backfill exports it based on its event time `T`. 2. After the backfill completes, incremental sync resumes from its previous ingestion-time cursor. When it advances to `T + N`, it exports the same late-arriving event a second time. User data — whether written by scheduled incremental sync or backfill — is upserted by `ssid` with exactly-once semantics, so it is not affected. **Solutions**: The two approaches below can be used independently or together. **Option 1: Create a deduplication view (recommended as the default query entry point)** Create a view in ByteHouse that keeps the latest row per event identity: ```sql CREATE VIEW sensorswave.sensorswave_events_dedup AS SELECT time, distinct_id, trace_id, ssid, anon_id, login_id, event, properties, user_properties, received_at FROM ( SELECT *, row_number() OVER ( PARTITION BY time, event, distinct_id, trace_id ORDER BY received_at DESC ) AS rn FROM sensorswave.sensorswave_events ) WHERE rn = 1; ``` **Option 2: Only backfill data older than 7 days** The server rejects events that arrive more than 7 days late, so time windows older than 7 days will not receive any further late-arriving events through incremental sync. Limiting the backfill end date to "current date − 7 days" eliminates duplicates at the source. For data within the last 7 days, rely on Option 1 and query through the deduplication view. ## FAQ **Q: Test Connection fails at "CnchS3 read failed". What should I check?** Work through these three items in order: 1. Did you run `GRANT S3, CREATE TEMPORARY TABLE ON *.* TO sensorswave_role` in the ByteHouse admin console or via bytehouse-cli? 2. Does the TOS AK/SK have read and list permissions? 3. Are `Endpoint` / `Region` / `Bucket` correct? **Q: The pipeline is "Running". Can I modify the connection settings or event filters?** Yes. Go to the pipeline detail page and click **Edit** to update the configuration. Changes take effect in the next sync window after the current one completes; frequency changes take effect immediately. **Q: There are duplicate rows in the events table. How do I handle this?** Scheduled incremental sync for events uses at-least-once delivery, so duplicates come from two scenarios: (1) an incremental sync task re-writes the same batch on retry after a failure (rare); (2) a backfill window covers a late-arriving event's occurrence time `T` and exports it by event time, and incremental sync then exports it again when it advances to ingestion time `T + N`. There are two ways to handle it: - **Deduplicate at query time**: Use the deduplication view `sensorswave_events_dedup` from the "Notes" section above as the default query entry point. Alternatively, deduplicate using `row_number()` over the combination of `time`, `event`, `distinct_id`, and `trace_id`, keeping the record with the latest `received_at`. This covers both scenarios. - **Avoid at backfill time**: Only backfill data older than "current date − 7 days". The server rejects events arriving more than 7 days late, so this range will not receive any new late-arriving events via incremental sync. **Q: How do I query data inside `JSONB` columns (`properties` and `user_properties`)?** Use ByteHouse's `jsonb_extract_*` functions with a JSONPath expression (`$.field_name`) to access nested fields by type: ```sql -- String field SELECT jsonb_extract_string(properties, '$.page_url') FROM sensorswave.sensorswave_events; -- Integer field SELECT jsonb_extract_int64(properties, '$.item_count') FROM sensorswave.sensorswave_events; -- Float field SELECT jsonb_extract_float64(properties, '$.total_amount') FROM sensorswave.sensorswave_events; -- Boolean field SELECT jsonb_extract_bool(properties, '$.is_vip') FROM sensorswave.sensorswave_events; ``` See the [ByteHouse JSONB type reference](https://www.volcengine.com/docs/6517/1580784?lang=en) for the complete function list. **Q: What timezone is the `time` column in ByteHouse?** The `time` column uses `DateTime64(3, '{system_tz}')`, bound to the SaaS server timezone: CN clusters use `Asia/Shanghai` (UTC+8), overseas clusters use `UTC`. The column cannot be rendered in a different timezone inside ByteHouse; if you need cross-timezone analysis, convert query results at the application layer or in your BI tool. **Q: Why must `S3` and `CREATE TEMPORARY TABLE` be granted `ON *.*`?** In ByteHouse's authorization model, neither privilege belongs to any Database. `S3` is a Source-class privilege (like `URL`, `HDFS`, `MYSQL`), and `CREATE TEMPORARY TABLE` applies to session-level temporary tables (which have no Database). Grammatically the only valid form is `ON *.*`; this does not grant cross-database read/write capability. Actual data access is still governed by grants on `.*`. --- **Last updated**: April 23, 2026