The BigQuery connector works by giving a Mixpanel-managed service account permission to read from BigQuery in your GCP project. You will need:

  CREATE SCHEMA `<gcp-project>`.`mixpanel`
    OPTIONS (
      description = 'Mixpanel connector staging dataset',
      location = '<same-as-the-tables-to-be-synced>',
    );

Grant the the Mixpanel service the following permissions:


Step 2: Load a warehouse table

Navigate to Project Settings → Warehouse Data (opens in a new tab) and click +Event Table.

Select a table or view representing an event from your warehouse and tell Mixpanel about the table. Once satisfied with the preview, click run and we’ll establish the sync. The initial load may take a few minutes depending on the size of the table, we show you progress as it’s happening.

🎉 Congrats, you’ve loaded your first warehouse table into Mixpanel! From this point onward, the table will be kept in sync with Mixpanel. You can now use this event throughout Mixpanel’s interface.

Table Types

Mixpanel’s Data Model consists of 4 types: Events, User Profiles, Group Profiles, and Lookup Tables. Each have properties, which are arbitrary JSON. Warehouse Connectors lets you turn any table or view in your warehouse into one of these 4 types of tables, provided they match the required schema.

Events

An event is something that happens at a point in time. It’s akin to a “fact” in dimensional modeling or a log in a database. Events have properties, which describe the event. Learn more about Events here.

Here’s an example table that illustrates what can be loaded as events in Mixpanel. The most important fields are the timestamp (when) and the user id (who) — everything else is optional.

TimestampUser IDItemBrandAmountType
2024-01-04 11:12:00alice@example.comshoesnike99.23in-store
2024-01-12 11:12:00bob@example.comsocksadidas4.56online

Here are more details about the schema we expect for events:

ColumnRequiredTypeDescription
Event NameYesStringThe name of the event. Eg: Purchase Completed or Support Ticket Filed. Note: you can specify this value statically, it doesn’t need to be a column in the table.
TimeYesTimestampThe time at which the event occurred.
User IDNoString or IntegerThe unique identifier of the user who performed the event. Eg: 12345 or grace@example.com.
Device IDNoString or IntegerAn identifier for anonymous users, useful for tracking pre-login data. Learn more here
JSON PropertiesNoJSON or ObjectA field that contains key-value properties in JSON format. If provided, Mixpanel will flatten this field out into properties.
All other columnsNoAnyThese can be anything. Mixpanel will auto-detect these columns and attach them to the event as properties.

User Profiles

A User Profile is a table that describes your users. It’s akin to a “dimension” in dimensional modeling or a relational table in a database. Learn more about User Profiles here.

Here’s an example table that illustrates what can be loaded as user profiles in Mixpanel. The only important column is the User ID, which is the primary key of the table.

User IDEmailNameSubscription Tier
12345grace@example.comGrace HopperPro
45678bob@example.comBob NoyceFree

Profile History is in beta. While Profiles typically only store the state of a user as of now, Profile History enables storing the state of a user over time. When creating a User Profile sync, set the Table Type to “History Table”. We expect tables to be modeled as a SCD (Slowly Changing Dimensions) Type 2 table. You will need to supply a Start Time column in the sync configuration. Mixpanel will infer a row's end time if a new row with a more recent start time for the same user is detected. Request beta access here (opens in a new tab).

Group Profiles

A Group Profile is a table that describes an entity (most often an Account, if you’re a B2B company). They are functionally identical to User Profiles, just used for other non-User entities. Group Profiles are only available if you have the Group Analytics add-on. Learn more about Group Analytics here.

Here’s an example table that illustrates what can be loaded as group profiles in Mixpanel. The only important column is the Group Key, which is the primary key of the table.

Group KeyNameDomainARRSubscription Tier
12345Notionnotion.so45000Enterprise
45678Linearlinear.so2000Pro

Lookup Tables

A Lookup Table is useful for enriching Mixpanel properties (e.g. content, skus, currencies) with additional metadata. Learn more about Lookup Tables here. Do note the limits of lookup tables indicated here.

Here’s an example table that illustrates what can be loaded as lookup table in Mixpanel. The only important column is the ID, which is the primary key of the table that is eventually mapped to a Mixpanel property

IDSong NameArtistGenre
12345One DanceDrakePop
45678VoyagerDaft PunkElectronic

Sync Modes

Warehouse Connectors regularly check warehouse tables for changes to load into Mixpanel. The Sync Mode determines which changes Mixpanel will reflect.

Mirror

Mirror syncs work by having the warehouse compute which rows have been inserted, modified, or deleted and sending this list of changes to Mixpanel. Change tracking is configured differently depending on the source warehouse. Mirror is supported for Snowflake and BigQuery sources.

Mirror takes BigQuery table snapshots (opens in a new tab) and runs queries to compute the change stream between two snapshot. Snapshots are stored in the mixpanel dataset created in Step 1.

Considerations when using Mirror with BigQuery:

How changes are detected:

Changed rows are detected by checksumming the values of all columns except trailing NULL-valued columns. For example in the following table would use these per-row checksums:

IDSong NameArtistGenreComputed checksum
12345One DanceDrakeNULLCHECKSUM(12345, 'One Dance', 'Drake')
45678VoyagerDaft PunkElectronicCHECKSUM(45678, 'Voyager', 'Daft Punk', 'Electronic')
83921NULLNULLClassicalCHECKSUM(83921, NULL, NULL, 'Classical')

Trailing NULL-values are excluded from the checksum to ensure that adding new columns does not change the checksum of existing rows. For example if a new column is added to the example table:

ALTER TABLE songs ADD COLUMN Tag STRING NULL;

It would not change the computed checksums:

IDSong NameArtistGenreTagComputed checksum
12345One DanceDrakeNULLNULLCHECKSUM(12345, 'One Dance', 'Drake')
45678VoyagerDaft PunkElectronicNULLCHECKSUM(45678, 'Voyager', 'Daft Punk', 'Electronic')
83921NULLNULLClassicalNULLCHECKSUM(83921, NULL, NULL, 'Classical')

Until values are written to the new column:

IDSong NameArtistGenreTagComputed checksum
12345One DanceDrakeNULLtag1CHECKSUM(12345, 'One Dance', 'Drake', NULL, 'tag1')
45678VoyagerDaft PunkElectronictag2CHECKSUM(45678, 'Voyager', 'Daft Punk', 'Electronic', 'tag2')
83921NULLNULLClassicalNULLCHECKSUM(83921, NULL, NULL, 'Classical')

Handling schema changes when using Mirror with BigQuery:

Adding new, default-NULL columns to Mirror-tracked tables/views is fully supported as described in the previous section.

ALTER TABLE <table> ADD COLUMN <column> STRING NULL;

We recommend avoiding other types of schema changes on large tables. Other schema changes may cause the checksum of every row to change, effectively re-sending the entire table to Mixpanel. For example, if we were to remove the Genre column in the example above the checksum of every row would be different:

IDSong NameArtistTagComputed checksum
12345One DanceDraketag1CHECKSUM(12345, 'One Dance', 'Drake', 'tag1')
45678VoyagerDaft Punktag2CHECKSUM(45678, 'Voyager', 'Daft Punk', 'tag2')
83921NULLNULLNULLCHECKSUM(83921)

Handling partitioned tables:

When syncing time partitioned (opens in a new tab) or ingestion-time partitioned (opens in a new tab) tables Mirror will use partition metadata to skip processing partitions that have not changed between sync runs. This will make the computation of the change stream much more efficient on large partitioned tables where only a small percentage of partitions are update between runs. For example, in a day-partitioned table with two years of data, where only the last five days of data are normally updated only five partitions worth of data will be scanned each time the sync runs.


Append

Append syncs require an Insert Time column in your table. Mixpanel remembers the maximum Insert Time it saw in the previous run of the sync and looks for only rows that have an Insert Time greater than that. This is useful and efficient for append-only tables (usually events) that have a column indicating when the data was appended.

Each time an Append sync runs, it will query the source table with a WHERE <insert_time_column> > <previous_max_insert_time> clause. This means that records added with an append time value before the <previous_max_insert_time> from the previous run can be missed (not imported) as they would be considered already ingested. The <insert_time_column> value should always reflect when the value was made available for Mixpanel to query and ingest.

Considerations when using Append with large BigQuery tables:

In an un-partitioned BigQuery table, the <insert_time_column> filtering results in a full scan of all data in the source table each time the sync runs. To minimize BigQuery costs we recommend partitioning the source table by the <insert_time_column> (opens in a new tab). Doing so will ensure that each incremental sync run only scans the most recent partitions (opens in a new tab).

To understand the potential savings consider a 100 GB source table with 100 days of data (approximately 1 GB of data per day):


Full

Full syncs periodically make a snapshot of the source table and sync it entirely to Mixpanel. If a row has new properties in your warehouse, the corresponding profile in Mixpanel will be overridden with those new properties. This mode is available for all tables except events.

Sync Frequency

Mixpanel offers a variety of sync frequency options to cater to different data integration needs. These options allow you to choose how often your data is synchronized from your data warehouse to Mixpanel, ensuring your data is up-to-date and accurate.

Standard Sync Frequency Options

Advanced Sync Frequency Option: Trigger via API

For more advanced synchronization needs, Mixpanel offers the ability to trigger syncs via API. This option generates a PUT URL that customers can use in their code to orchestrate Mixpanel sync jobs with other jobs, such as Fivetran pipelines or dbt jobs. By using this API trigger option, you can ensure 100% accuracy by aligning Mixpanel syncs with other critical data operations.

To use the API trigger option:

  1. Select Advanced>Trigger Via API under Sync Frequency in the table sync creation UI.
  2. After creating the sync, we will generate a PUT URL for you.
  3. Integrate this URL into your existing workflows or scripts.
  4. Authenticate the request with a Mixpanel Service Account. More information on setting up and using Mixpanel Service Accounts can be found here (opens in a new tab).
  5. Trigger the sync job programmatically, ensuring it runs in coordination with other data processes.

This flexibility allows you to maintain precise control over when and how your data is updated in Mixpanel, ensuring your analytics are always based on the latest information.

Note: If your table sync is set up with Mirror mode, you will need to run a sync job at least every 2 weeks to ensure our snapshots do not get deleted. We rate limit the number of syncs via API to 5 per hour.

FAQ

What tables are valuable to load into Mixpanel?

Anything that is event-based (has a user_id and timestamp) and that you want to analyze in Mixpanel. Examples, by data source are:

We also recommend loading your user and account tables, to enrich events with demographic attributes about the users and accounts who performed them.

How fast do syncs transfer data?

Syncs have a throughput of ~30K events+updates/second or ~100M events+updates/hour.

What is the best way to start bringing in event data?

We recommend starting with a subset of data in a materialized view to test the import process. This allows you to ensure that relevant columns are correctly formatted and the data appears as expected in Mixpanel. Once the data is imported, run a few reports to verify that you can accurately gain insight into your team’s KPIs with the way your data is formatted.

After validating your use case, navigate to the imported table and select "Delete Import" to hard delete the subset data. This step ensures that you can then import the entire table without worrying about duplicate data.

I already track data to Mixpanel via SDK or CDP, can I still use Warehouse Connectors?

Yes! You can send some events (eg: web and app data) directly via our SDKs and send other data (eg: user profiles from CRM or logs from your backend) from your warehouse and analyze them together in Mixpanel.

How do updates & deletes from Mirror syncs affect my event quota usage?

On an Events billing plan your event quota is consumed by:

You can see how much of your quota is being consumed by each warehouse connector in the detailed data usage view (opens in a new tab) for your organization.

On an MTU billing plan updates and deletes from Mirror do not impact MTU calculations, however updates and deletes are counted as events for the purposes of the MTU guardrail computation.

What will be the cost impact of this on my DWH?

The DWH cost of using a warehouse connector will vary based on the source warehouse and sync type used. Our connectors use warehouse-specific change tracking to compute modified rows in the warehouse and send only changed data to Mixpanel.

There are 3 aspects of DWH cost: network egress, storage, and compute.

How can I get help setting up a warehouse connector?

Reach out (opens in a new tab) to our team — we’re happy to walk you through the set up. If you bring a data engineer who has credentials to access your warehouse, it takes < 10 minutes to get up and running.

JavaIntegrations

Was this page useful?