BigQuery (relational database)

Introduction

BigQuery is a database technology useful in storing and analyzing large and complex datasets like longitudinal data in FinnGen. This makes it easier for user to query, using SQL, subset of data much faster than reading entire dataset and then filter out. Queried data can be directly inserted into downstream analysis.

Data

finngen-production-library project/database contains phenotype data for different release of FINNGEN data. The data in this project/database comes from LIBRARY_RED and LIBRARY_GREEN folders within Sandbox IVM folders.

BigQuery Project/Database

BigQuery Dataset/Schema

BigQuery Table Name

BigQuery Table Details

Sandbox IVM folder

Release Version

finngen-production-library

sandbox_tools_r{RELEASE}

finngen_r13_service_sector_detailed_longitudinal_v1

Service Sector Detailed Longitudinal data

LIBRARY_RED

DF13

endpoint_cohorts_r13_v3

Endpoint cohorts

LIBRARY_RED

DF13

code_counts_r12_v1

Code Counts

LIBRARY_RED

DF12

minimum_extended_r13_v1

Phenotype Minimum Extended data

LIBRARY_RED

DF13

birth_mother_r13_v1

Birth and DVV registry of Mother

LIBRARY_RED

DF12

vision_r13_v1

Visual impairment registry

LIBRARY_RED

DF13

kidney_r13_v1

Kidney disease registry

LIBRARY_RED

DF12

kanta_r13_v1

Kanta Lab values registry

LIBRARY_RED

DF13

medical_codes

finngen_vnr_v2

VNR information

LIBRARY_GREEN

INDEPENDENT

fg_codes_info_v10

Code translation info table

LIBRARY_GREEN

finngen_omop_r13

OMOP CDM tables

LIBRARY_RED

DF13

finngen_results_r13

Achilles tables

LIBRARY_RED

DF13

Superset

Superset is a web tool that can be used to explore FinnGen data stored in BiqQuery database. Users can do SQL queries to the data and see the results as a data visualization. Superset is accessed from IVM Applications menu (Applications>FinnGen>Superset).

In Explore tab the user can do different visualizations to the data using the graphical interface. In the SQL Lab tab direct SQL queries can be done to the data.

Take a look how to export FINNGENID from Atlas using SuperSet.

How to save results

It is possible to save results into Sandbox specific BQ database named “Sandbox” (see available databases from Superset Database tab). Users can import small custom tables BQ database “Sandbox” and export result tables to Sandbox IVM as a csv file.

Superset demo

See a Superset tutorial video from FinnGen Users' Meeting 22th Sebtember 2020 (at 50min).

More information:

Apache Superset

bq command-line tool

It is also possible to access the BigQuery tables using bq command-line tool which is a Python-based command-line tool. bq command-line tool can be used to run queries.

Before running a query, users can check how much does it cost of running query using bq command-line tool

bq query --dry_run

The dry run will not execute the query but rather gives out how much data it consumes. In general, queries are priced using on-demand with estimate of $6.25 per Tebibyte (TiB). More details on pricing structure can be found here.

Google Cloud BigQuery python and R drivers

It is also possible to use google cloud BigQuery python and R drivers to access data directly from IVM.

See a tutorial video about how to conduct BigQuery using Python and R scripts from Users' meeting recordings (at 30min 2sec).

Use cases

Links on how to connect to BigQuery in R and Python along with some use cases for downstream analysis

Example Python script:

from google.cloud import bigquery
def examplequery():
# Define project running queries == users own sandbox project, 
# by default it matches your SB environment.
# see Sandbox no from Sandbox IVM desktop;buckets.txt

client = bigquery.Client()

# Tables must be defined in format project.dataset.table, 
# note that cohort 5 must be generated in Atlas.

query_job = client.query(
 """
 SELECT person_source_value
 FROM finngen-production-library.finngen_omop_result.cohort
 LEFT JOIN finngen-production-library.finngen_omop.person ON cohort.subject_id = person.person_id
 WHERE cohort_definition_id = 5
 ORDER BY person_source_value
 LIMIT 20
 """
 )
 results = query_job.result()
 print("20 first finngen ids from atlas cohort 5")
 for row in results:
 print("{}".format(row.person_source_value))
if __name__ == "__main__":
 examplequery()

Example R script:

#!/usr/bin/Rscript
library(bigrquery)

# Define scope for queries. Currently only readonly is enabled
bq_auth(scopes="https://www.googleapis.com/auth/bigquery.")

# Define project running queries == users own sandbox project
projectid = "fg-production-sandbox-6"

#note that cohort 5 must be generated in Atlas
sql <- "SELECT person_source_value
 FROM finngen-production-library.finngen_omop_result.cohort
 LEFT JOIN finngen-production-library.finngen_omop.person ON cohort.subject_id = person.person_id
 WHERE cohort_definition_id = 5
 ORDER BY person_source_value
 LIMIT 20"
tb <- bq_project_query(projectid, sql)
df <- bq_table_download(tb)
print("20 first finngen ids from atlas cohort 5")
print(df[["person_source_value"]])

PreviousCohort Pathways NextAtlas vs BigQuery cohorts

Last updated 2 months ago

Was this helpful?