Title: BigQuery
Locale: en
URL: https://sensorswave.cn/en/docs/data-center/pipeline/destinations/bigquery/
Description: Configure a Google BigQuery export pipeline to regularly sync event and user data to your BigQuery data warehouse
The BigQuery connector syncs event and user data from Sensors Wave to your Google BigQuery data warehouse on a recurring schedule, enabling teams in the Google Cloud ecosystem to run deep SQL analysis and join with other business datasets.

> **Note**: The BigQuery connector is currently **only available on the overseas cluster**. If you are on the CN cluster, please use the Snowflake or ByteHouse connector instead.

Typical use cases:

- Sync user behavior events to BigQuery to join with Google Ads, GA4, and CRM data
- Export user profile data and build models or dashboards with Looker, dbt, etc.
- Retain historical event data from Sensors Wave for compliance or long-term analytics requirements

The BigQuery connector supports two export types:

- **Event export**: Incrementally syncs user behavior events to the `sensorswave_events` table (default, customizable at creation)
- **User export**: Incrementally syncs user profile attributes to the `sensorswave_users` table (default, customizable at creation)

Each export type requires a separate pipeline.

## Prerequisites

Before creating a pipeline in Sensors Wave, complete the following setup in your Google Cloud account. We recommend creating a dedicated Service Account for Sensors Wave to follow the principle of least privilege.

The following steps use these example names; adjust to match your organization's naming conventions:

| Object | Example | Description |
|--------|---------|-------------|
| Dataset | `sensorswave` | The BigQuery dataset that holds the target tables |
| Service Account | `sensorswave-user` | The service account used by the pipeline |

### Step 1: Prepare the GCP Project and Dataset

1. In [Google Cloud Console](https://console.cloud.google.com/), select or create a GCP project.
2. Open **BigQuery → BigQuery Studio**, click the kebab menu (⋮) next to your project, and choose **Create dataset**.
3. Fill in:
   - **Dataset ID**: e.g. `sensorswave`
   - **Location**: Choose a region appropriate for your business (e.g. `US`, `asia-east1`). The location cannot be changed after the dataset is created.
4. Click **Create dataset**.

> **Note**: BigQuery pipelines **will not auto-create the dataset**. Make sure the dataset exists before creating the pipeline.

### Step 2: Create a Service Account and Grant IAM Roles

Sensors Wave authenticates to BigQuery using a Service Account JSON Key. Grant the following two predefined roles:

| Role | Scope | Purpose |
|------|-------|---------|
| `BigQuery Job User` (`roles/bigquery.jobUser`) | Project | Create Load Jobs and Query Jobs |
| `BigQuery Data Editor` (`roles/bigquery.dataEditor`) | Dataset | Create tables, read/write data, and drop temporary tables in the target dataset |

Steps:

1. In Cloud Console, open **IAM & Admin → Service Accounts** and click **Create service account**.
2. Enter a name (e.g. `sensorswave-user`) and click **Create and continue**.
3. In the "Grant this service account access to project" step, select the **BigQuery Job User** role and click **Done**.
4. Open **BigQuery → BigQuery Studio**, find the target dataset, click ⋮ → **Share**, and add the service account as a **BigQuery Data Editor**.

> **Tip**: If your security policy requires least-privilege access, create a custom role with only the following atomic permissions:
>
> - Project level: `bigquery.jobs.create`
> - Dataset level: `bigquery.datasets.get`, `bigquery.tables.create`, `bigquery.tables.get`, `bigquery.tables.getData`, `bigquery.tables.updateData`, `bigquery.tables.delete`

### Step 3: Download the Service Account JSON Key

1. Click the service account you just created in the list.
2. Switch to the **Keys** tab, click **Add key → Create new key**, choose **JSON**, and click **Create**.
3. Your browser downloads the JSON Key file. Store it securely for the next step.

> **Note**: The JSON Key contains a private key and is a sensitive credential. Sensors Wave encrypts it at rest and will not display it again after saving. Do not commit it to source control or share it with unauthorized parties.

### Network Access

The BigQuery API (`bigquery.googleapis.com`) is publicly accessible over the Google network by default, so **IP allowlisting is generally not required**.

If you have enabled [VPC Service Controls](https://cloud.google.com/vpc-service-controls) to restrict the BigQuery API, add the Sensors Wave overseas-cluster egress IP to your access policy:

- **Overseas cluster egress IP**: `43.133.171.172`

## Create a Pipeline

After completing the GCP setup, create a pipeline in Sensors Wave.

1. Go to **Data Center → Pipeline** and click **New Pipeline**.
2. In the dialog, select **Export pipeline** and click **Next**.
3. Select **BigQuery** as the destination.
4. Select the **data source type**:
   - **Events**: Sync user behavior events
   - **Users**: Sync user profile attributes
5. Fill in the BigQuery connection settings (see "Configuration Reference" below):
   - Paste the entire content of the Service Account JSON Key into the **Service Account JSON** field. Sensors Wave automatically parses `project_id`, `client_email`, `private_key`, and other fields.
   - Enter the **Dataset ID** and **Table ID**.
6. Click **Test Connection** and wait for all 6 verification steps to pass.
7. Configure the **execution frequency** (see "Execution Frequency" below).
8. To sync only specific events, configure event filtering in **Advanced Options** (events export only).
9. Click **Save**. The pipeline starts immediately and automatically creates the target table in BigQuery.

## Configuration Reference

### Connection Settings

| Field | Required | Description | Example |
|-------|----------|-------------|---------|
| `Service Account JSON` | Yes | Paste the full content of the downloaded Service Account JSON Key. Not displayed again after saving. | `{"type": "service_account", "project_id": "...", "private_key": "..."}` |
| `Dataset ID` | Yes | The BigQuery dataset ID where the target table lives | `sensorswave` |
| `Table ID` | Yes | Target table name (customizable). Defaults to `sensorswave_events` (events) or `sensorswave_users` (users). Created automatically on save. | `sensorswave_events` |

### Execution Frequency

**Interval-based**: Triggers every fixed interval (1–23 hours), starting from the pipeline creation time.

| Option | Recommended for |
|--------|-----------------|
| Every 1 hour (default) | Most analytics scenarios |
| Every 2 hours | Reduce frequency for larger datasets |
| Every 6 hours | Routine incremental sync |
| Every 12 hours | User data and other low-update-frequency scenarios |
| Every 24 hours | Daily batch export |

**Scheduled**: Triggers at a fixed time of day in the specified time zone—useful for off-peak windows. After choosing a time zone and trigger time, the pipeline runs once per day at that time. The time zone only affects the trigger time. See [Time Zones](#time-zones) for details.

## Data Model

### Events table (default `sensorswave_events`)

Event data is exported to this table, partitioned by day on the `time` column (`PARTITION BY DATETIME_TRUNC(time, DAY)`) for efficient partition pruning. The default table name is shown below.

| Column | Type | Description |
|--------|------|-------------|
| `time` | `DATETIME` | Event occurrence time (wall clock, no time zone) |
| `distinct_id` | `STRING` | Client-side user identifier |
| `trace_id` | `STRING` | Request trace ID |
| `ssid` | `INT64` | Server-side user identifier (SSID) |
| `anon_id` | `STRING` | Anonymous ID |
| `login_id` | `STRING` | Login ID |
| `event` | `STRING` | Event name, e.g. `$pageview`, `purchase` |
| `properties` | `JSON` | Event properties |
| `user_properties` | `JSON` | Snapshot of user properties at event time |
| `received_at` | `DATETIME` | Server-side ingestion time (wall clock) |

For details on the time zone of the time columns and any datetime fields inside `properties` / `user_properties`, see [Time Zones](#time-zones).

Example query against `JSON` columns:

```sql
-- Query order_id and total_amount from event properties
SELECT
    time,
    event,
     JSON_VALUE(properties, '$."order_id"') AS order_id,
     CAST(JSON_VALUE(properties, '$."total_amount"') AS FLOAT64) AS total_amount
FROM `my-gcp-project.sensorswave.sensorswave_events`
WHERE event = 'purchase'
  AND time >= DATETIME '2026-04-01 00:00:00'
LIMIT 100;
```

### Users table (default `sensorswave_users`)

User data is exported to this table, clustered by `ssid` (`CLUSTER BY ssid`) for fast point lookup by user. Writes use MERGE semantics: matched on `ssid`, update if exists, insert otherwise.

| Column | Type | Description |
|--------|------|-------------|
| `ssid` | `INT64` | Server-side user identifier |
| `login_id` | `STRING` | Login ID |
| `anon_id` | `STRING` | Anonymous ID |
| `properties` | `JSON` | All user profile properties |
| `created_at` | `DATETIME` | First-seen time (wall clock) |
| `updated_at` | `DATETIME` | Last-updated time (wall clock) |

Example query against `JSON` user properties:

```sql
-- Query membership level from user properties
SELECT
    ssid,
    login_id,
     JSON_VALUE(properties, '$."membership_level"') AS membership_level,
     updated_at
 FROM `my-gcp-project.sensorswave.sensorswave_users`
 WHERE JSON_VALUE(properties, '$."membership_level"') IS NOT NULL
LIMIT 100;
```

## Time Zones

BigQuery pipelines involve several distinct time zones. The summary below covers every value influenced by a time zone and where it comes from.

### Quick Reference

| Value | Time zone | Determined by |
|-------|-----------|---------------|
| Time columns (`time`, `received_at`, `created_at`, `updated_at`) | UTC (overseas-cluster wall clock) | SensorsWave server (not configurable) |
| Datetime fields inside `properties` / `user_properties` | UTC (matches the time columns) | Same as time columns |
| Backfill date-range time zone | **Project timezone** | Configured by admins in project settings |
| Scheduled trigger time | User-configured time zone | Selected at pipeline creation |

### Time Columns

All time columns in BigQuery (`time`, `received_at`, `created_at`, `updated_at`) use the `DATETIME` type (wall-clock time without time-zone metadata) and store **UTC wall-clock time** with millisecond precision (matching the precision of datetime fields inside `properties` / `user_properties`). The BigQuery connector is only available on the overseas cluster, whose wall-clock time zone is `UTC`.

**Example**: An event that occurs at UTC `2026-04-13 07:30:00.123` is stored in the `time` column as `2026-04-13 07:30:00.123`.

To convert to another time zone in queries, first cast to `TIMESTAMP` with `TIMESTAMP(col, 'UTC')`, then render to the target zone with `DATETIME(..., '')`:

```sql
SELECT DATETIME(TIMESTAMP(time, 'UTC'), 'America/New_York') AS time_local
FROM `my-gcp-project.sensorswave.sensorswave_events`;
```

### Datetime Fields Inside JSON Properties

If an event or user property is itself a datetime value, it is serialized into `properties` / `user_properties` as a string:

- **Format**: `YYYY-MM-DD HH:MM:SS.fff` (millisecond precision, no time-zone offset)
- **Time zone**: Same as the time columns—UTC wall clock

**Example**: `{"purchase_at": "2026-04-13 07:30:00.000"}`

### Backfill Date-Range Timezone

BigQuery backfill uses the **project timezone** to interpret the backfill date range. The project timezone is configured by admins in project settings; if unset, the system default `UTC` is used.

The backfill date range is interpreted as a **half-open** time interval `[start, end)` — the `start` instant is included and the `end` instant is excluded, so two consecutive backfills chain at the boundary without double-covering events.

**Example 1: Project timezone matches the time-column timezone (project = `UTC`)**

A backfill of `2026-04-01 ~ 2026-04-03` is interpreted as:

- `start = 2026-04-01 00:00:00` (UTC, midnight of the start date)
- `end = 2026-04-04 00:00:00` (UTC, midnight of the day after the end date)

Because BigQuery's `time` column is also stored in UTC wall clock, this interval covers exactly three calendar days of events.

**Example 2: Project timezone differs from the time-column timezone (project = `America/New_York`, time columns = UTC)**

A backfill of `2026-04-01 ~ 2026-04-01` is interpreted as:

- Expressed in the **project timezone `America/New_York`**: `[2026-04-01 00:00:00 EDT, 2026-04-02 00:00:00 EDT)`
- Expressed in BigQuery's **UTC wall clock**: `[2026-04-01 04:00:00, 2026-04-02 04:00:00)`

So this backfill covers the UTC `time` range `2026-04-01 04:00 ~ 2026-04-02 04:00` — **not** all of UTC April 1.

> **Note**: The timezone used to interpret the backfill date range may differ from the time-column storage timezone. The backfill date range is interpreted in the project timezone, while time columns are always stored in UTC wall clock. If the two differ, the actual `time` range covered aligns with calendar days in the project timezone, not in UTC.

### Scheduled Trigger Time Zone

When you configure a scheduled run, you choose a time zone. For example, `America/New_York` 02:00 fires every day at 2 AM New York time. This time zone only affects the trigger time—it does not affect the time-column storage values or how the backfill date range is interpreted.

## Event Filtering

Event filtering applies only to **event exports** and supports two modes:

- **Include events (allowlist)**: Only export the events listed; skip all others
- **Exclude events (blocklist)**: Export all events except the ones listed

In **Advanced Options**, choose a filter mode and enter the event names (e.g., `$pageview`, `purchase`).

> **Note**: After modifying event filters, the change takes effect in the next sync window. Already-exported data is not reprocessed.

## Test Connection

When you click **Test Connection**, Sensors Wave runs end-to-end validation of your BigQuery configuration (authentication, dataset access, table creation, Load Job, and query permissions).

### Common Connection Errors

| Error | Likely cause | Resolution |
|-------|--------------|------------|
| Service Account JSON missing fields | The pasted JSON is incomplete or missing key fields | Confirm the JSON contains `project_id`, `private_key`, `client_email`, and `token_uri` |
| Authentication failed | `private_key` is invalid, or the service account is disabled / deleted | Check the service account status in Cloud Console; regenerate the JSON Key if needed |
| Dataset does not exist or is not accessible | Dataset name is misspelled, or the service account lacks `BigQuery Data Editor` | Verify the dataset name; confirm the service account has been shared on the dataset |
| Permission denied on Job creation | The service account lacks the project-level `BigQuery Job User` role | In IAM, grant the service account the project-level `BigQuery Job User` role |

## Pipeline Management and Monitoring

### Pipeline States

The lifecycle of a pipeline is:

```
Created → Running → Stopped (irreversible)
```

> **Note**: Stopping a pipeline is irreversible. After stopping, all schedules end and no new sync tasks are produced. To resume exporting, create a new pipeline. Run history is retained after stopping.

### View Run History

The **Runs** tab in the pipeline detail page shows execution details for each sync task:

- **Status**: Running / Success / Failed
- **Data range**: The time window covered by this run
- **Rows exported**: Rows written to BigQuery in this run
- **Execution time**: Start and end times
- **Logs**: Click an entry to view detailed execution logs

### View Metrics

The **Metrics** tab shows the overall pipeline state (last 30 days by default):

- Trend chart of successful / failed runs
- Trend chart of daily exported rows
- Custom time-range filter

## Historical Data Backfill

### What Is Backfill

After a pipeline is created, incremental sync only processes new data from the creation time onward. To import historical data from before the creation time, use the Backfill feature.

### Steps

1. Open the pipeline detail page and switch to the **Backfill** tab.
2. Click **Start backfill** and choose the date range (start and end).
3. Review the date range and click **Confirm**.

> **Note**: The backfill date range is interpreted as concrete start/end instants in the **project timezone**. See [Time Zones](#time-zones).

### Backfill and Incremental Sync

- While a backfill is running, scheduled incremental sync for that pipeline is **automatically paused**—no data is lost. After the backfill finishes or is canceled, incremental sync resumes from where it stopped.
- Only one backfill can run at a time per pipeline.
- If a backfill is interrupted (for example, by a service restart), it resumes from the last completed window. Already-completed windows are not redone.

### Cancel a Backfill

While a backfill is running, click **Cancel** to stop it. The cancellation takes effect after the current window completes—it does not interrupt the in-flight window.

## Delivery Semantics and Data Consistency

| Data type | Task type | Write mode | Delivery semantics |
|-----------|-----------|------------|--------------------|
| Events | Scheduled incremental sync | Append | at-least-once |
| Events | Historical backfill | Multi-statement transaction; atomic per window | exactly-once (per window) |
| Users | Scheduled / backfill | MERGE on `ssid` | exactly-once |

## Notes

- **Overseas cluster only**: This connector is not available on the CN cluster.
- **Pre-create the dataset**: The pipeline does not auto-create the dataset. Create it in advance in Cloud Console with the desired location.
- **Tables are auto-created**: Saving the pipeline creates the target table (with partitioning and clustering) automatically. You do not need to create it manually.
- **Service Account JSON security**: The `private_key` inside the JSON is encrypted at rest and not displayed again after saving. When editing a pipeline, leave the field blank to reuse the stored credential.
- **Stopping is irreversible**: After stopping, you must create a new pipeline to resume exporting. Run history is preserved.
- **Incremental sync pauses during backfill**: Scheduled incremental sync resumes automatically after backfill completes. No data is lost.
- **Time columns are UTC wall clock**: Convert when querying across time zones.
- **Duplicate rows in the events table**: A small number of duplicate rows may appear in `sensorswave_events`.

  **Why duplicates can occur**: Scheduled incremental sync writes events in append-only mode with at-least-once delivery semantics. Duplicates come from two sources:

  - **Source 1: Incremental retries on failure**. Scheduled incremental sync auto-retries on node failures, network errors, or BigQuery Load Job timeouts to prevent data loss. Load Jobs use a fixed Job ID for idempotency, but a small number of edge cases can still produce duplicate rows. This is rare and limited in scope.
  - **Source 2: Backfill overlapping with incremental**. Historical backfills are atomic per window (rerunning the same window does not duplicate within that window), but backfill and incremental sync are two independent pipelines that write independently. The `time` column records the actual event time, while incremental sync advances by server-side ingestion time. When both pipelines cover the same late-arriving event, duplicates appear. If an event occurs at `T` but isn't reported until `T + N`:
    1. A backfill is launched and selects data by event time, covering `T`. The event (already in storage with ingestion time `T + N`) is exported to BigQuery by event time `T`.
    2. After the backfill, incremental sync continues from its previous ingestion-time cursor; when it reaches `T + N`, it exports the same late-arriving event again.

  User data is updated by `ssid` MERGE in both incremental and backfill paths, with exactly-once delivery semantics—no duplication.

  **Solutions**: Use either approach below, or combine them.

  **Option 1: Create a dedup view (recommended for everyday querying)**

  Create a view in BigQuery that keeps the latest row per event identity:

  ```sql
  CREATE VIEW `my-gcp-project.sensorswave.sensorswave_events_dedup` AS
  SELECT * EXCEPT (rn)
  FROM (
      SELECT *,
             ROW_NUMBER() OVER (
                 PARTITION BY time, event, distinct_id, trace_id
                 ORDER BY received_at DESC
             ) AS rn
      FROM `my-gcp-project.sensorswave.sensorswave_events`
  )
  WHERE rn = 1;
  ```

  **Option 2: Only backfill data older than 7 days**

  The server rejects events that are more than 7 days late, so windows older than 7 days will not see new late-arriving events from incremental sync. Limit the backfill end date to "today − 7 days" to avoid Source 2 duplicates entirely. For the most recent 7 days, use Option 1.

## FAQ

**Q: Test Connection fails with "dataset does not exist"—how do I fix it?**

Confirm the `project_id` in the Service Account JSON matches the project that owns the BigQuery dataset; verify the dataset name (case-sensitive); make sure the service account has been granted `BigQuery Data Editor` on the dataset.

**Q: Test Connection passes, but at runtime I see "permission denied on bigquery.jobs.create"—how do I fix it?**

The service account lacks the project-level `BigQuery Job User` role. In Cloud Console, open **IAM & Admin → IAM**, find the service account, and grant `BigQuery Job User`.

**Q: Can I edit connection settings or event filters while the pipeline is running?**

Yes. Open the pipeline detail page and click **Edit**. Changes take effect in the next sync window; schedule frequency changes take effect immediately. To rotate the Service Account JSON, paste the full contents of the new JSON file.

**Q: How do I deal with duplicate rows in the events table?**

The events table uses at-least-once delivery semantics. Duplicates come from two sources: incremental retry (rare) or backfill overlapping with incremental on late-arriving events. Two ways to handle:

- **Query-side dedup**: Use the `sensorswave_events_dedup` view in the Notes section as your default query entry point, or apply `ROW_NUMBER()` deduplication on `time`, `event`, `distinct_id`, `trace_id`, keeping the row with the latest `received_at`.
- **Backfill-side avoidance**: Only backfill windows older than "today − 7 days".

**Q: How do I query `JSON` data inside `properties` and `user_properties`?**

BigQuery natively supports the `JSON` type. Use `JSON_VALUE` for scalar fields and `JSON_QUERY` for nested objects or arrays:

```sql
-- Extract a string field
 SELECT JSON_VALUE(properties, '$."page_url"') AS page_url
 FROM `my-gcp-project.sensorswave.sensorswave_events`;

 -- Extract a numeric field (CAST required)
 SELECT CAST(JSON_VALUE(properties, '$."item_count"') AS INT64) AS item_count
 FROM `my-gcp-project.sensorswave.sensorswave_events`;

 -- Extract a nested object
 SELECT JSON_QUERY(properties, '$."address"') AS address
 FROM `my-gcp-project.sensorswave.sensorswave_events`;
```

**Q: What time zone is the `time` column in BigQuery?**

The `time` column uses the `DATETIME` type and stores UTC wall-clock time (no time-zone metadata). To convert to another time zone:

```sql
SELECT DATETIME(TIMESTAMP(time, 'UTC'), 'America/New_York') AS time_local
FROM `my-gcp-project.sensorswave.sensorswave_events`;
```

**Q: Do I need to allowlist any IPs for BigQuery?**

Usually not. The BigQuery API is publicly accessible over the Google network. You only need to allowlist Sensors Wave's overseas-cluster egress IP `43.133.171.172` if you have enabled VPC Service Controls to restrict the API.

**Q: Is the BigQuery connector available on the CN cluster?**

Not currently. The BigQuery connector is only available on the overseas cluster.

---

**Last updated**: May 6, 2026