Data Export

SeaUrchin.IO Data Export is available upon request to our Enterprise customers. We make processed search log data available for download in an easily consumable JSON format on a daily basis.

Requesting Access

Please contact SeaUrchin.IO support and specify the flow id numbers (available in the flow settings tab) that you need access to the exports for. Export access is granted on a per-flow basis. We will provide an AWS Access Key Id and AWS Secret Key which are used to access the files. No matter how many flows you specify, you may use the same AWS keys.

Setup

The export files are stored on S3. As such, downloads are performed using an AWS S3 client program or library. The examples here will demonstrate using the AWS CLI (https://aws.amazon.com/cli/), but other clients may also work.


First, set up the AWS keys.

$ aws configure
AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Default region name [None]: us-west-2
Default output format [None]: ENTER

You will also need the GnuPG encryption program, available for Mac, Windows and Linux.

To see if you have it installed, type:

$ gpg --version
gpg (GnuPG) 1.4.18
...

If you don't have it installed, you can download it for your platform from https://www.gnupg.org/download/

Downloading and Extracting

$ aws s3 cp s3://seaurchin-io-data-prod/export/<flow id>/2016-03-02.json.gz.gpg /path/to/dl/

Replace <flow id> with the flow id you want to get and replace the date with the date you want. Files are available from 2016-03-01 forward. Files for a given day are created between 11:00pm and 1:00am and contain data from the previous 24 hours. These times shift with daylight savings time to maintain 24 hour periods year-round.

You should now have the file /path/to/dl/2016-03-02.json.gz.gpg


It now must be decrypted. To decrypt, type:

$ gpg -d /path/to/dl/2016-03-02.json.gz.gpg > /path/to/dl/2016-03-02.json.gz
gpg: AES encrypted data
Enter passphrase: Enter the decryption key from the flow settings tab

Now you have /path/to/dl/2016-03-02.json.gz, which is gzipped. You may use gunzip to decompress the file, which will give you a file with one json object per line, representing one logical search performed by the user.


You may explore the file and format the json quite easily using the jq utility (https://stedolan.github.io/jq/).

Pretty print everything:

$ jq . 2016-03-02.json

Print queries, one per line:

$ jq -r .query 2016-03-02.json

See jq docs for more expressive examples including filtering.

Format

The file is written as one json object per line in the UTF-8 format.

Fields with 0 or empty values may be omitted.

Field NameDescription
clicked_anyClicked any result? (0 or 1)
clicksNumber of clicks
clicks_1# of clicks on position 1
clicks_2# of clicks on position 2
clicks_3_4# of clicks on position 3 and 4
clicks_5_up# of clicks on position 5 and up
clicked_positionsArray of click positions
flowSearch flow ID
item_page_redirectsDirect hit? (0 or 1)
latency_under_50_msSearch returned in under 50ms
latency_under_100_msSearch returned in under 100ms
latency_under_200_msSearch returned in under 200ms
latency_under_500_msSearch returned in under 500ms
latency_under_1000_msSearch returned in under 1000ms
latency_under_2000_msSearch returned in under 2000ms
latency_under_5000_msSearch returned in under 5000ms
latency_under_10000_msSearch returned in under 10000ms
latency_unknownLatency unknown
mrrMean reciprocal rank
no_hitsZero hits? (0 or 1)
num_hitsNumber of results
queryQuery string
rbp_50Rank-biased precision, p = 0.5
rbp_80Rank-biased precision, p = 0.8
result_viewed_0_s# of results viewed for at least 0s
result_viewed_1_s# of results viewed for at least 1s
result_viewed_2_s# of results viewed for at least 2s
result_viewed_5_s# of results viewed for at least 5s
result_viewed_10_s# of results viewed for at least 10s
result_viewed_30_s# of results viewed for at least 30s
result_viewed_60_s# of results viewed for at least 60s
result_viewed_120_s# of results viewed for at least 120s
result_viewed_300_s# of results viewed for at least 300s
result_viewed_600_s# of results viewed for at least 600s
timestamp_secTimestamp of the first event in this search
visitorSeaUrchin.IO visitor ID
tag_<customTagName>See the API docs for details on how to get additional custom fields to appear in the export.

Example record

{
    "visitor": "ac4ff4e5-7b29-4f7f-812f-4d81bd3a0063",
    "rbp_80": 0.19999999999999996,
    "rbp_50": 0.5,
    "result_viewed_0_s": 1,
    "result_viewed_1_s": 1,
    "result_viewed_2_s": 1,
    "result_viewed_5_s": 1,
    "clicks_5_up": 0,
    "clicks_3_4": 0,
    "clicks_2": 0,
    "clicks_1": 1,
    "clicks": 1,
    "clicked_any": 1,
    "clicked_positions": [1],
    "query": "example query",
    "flow": 1234567890,
    "timestamp_sec": 1456950869,
    "mrr": 1,
    "latency_under_2000_ms": 1,
    "latency_under_5000_ms": 1,
    "latency_under_10000_ms": 1,
    "num_hits": 3,
    "tag_customTagName": "tag value"
}

Security

We use two layers of security to protect your data. First, we control access to your files using the AWS IAM system. Only your account's AWS key is allowed to download export files for your flows. We do not retain access to this key but we can reset it if required. Second, we encrypt all of the files using AES 128-bit symmetric encryption using the well-regarded open source GnuPG suite and a random passphrase (the decryption key). We recommend you safeguard access to both the AWS key and the decryption key within your organization.

Can't find what you're looking for? Send an email to support@seaurchin.io