BigQuery (relational database)
Introduction
BigQuery is a database technology useful in storing and analyzing large and complex datasets like longitudinal data in FinnGen. This makes it easier for user to query, using SQL, subset of data much faster than reading entire dataset and then filter out. Queried data can be directly inserted into downstream analysis.
Data
finngen-production-library
project/database contains phenotype data for different release of FINNGEN data. The data in this project/database comes from LIBRARY_RED and LIBRARY_GREEN folders within Sandbox IVM folders.
finngen-production-library
sandbox_tools_r13
finngen_r13_service_sector_detailed_longitudinal_v1
LIBRARY_RED
DF13
endpoint_cohorts_r13_v1
Endpoint cohorts
LIBRARY_RED
DF13
code_counts_r12_v1
Code Counts
LIBRARY_RED
DF12
fg_codes_info_v8
Code translation info table
LIBRARY_GREEN
finngen_results_r13
Achilles tables
LIBRARY_RED
DF13
Superset
Superset is a web tool that can be used to explore FinnGen data stored in BiqQuery database. Users can do SQL queries to the data and see the results as a data visualization. Superset is accessed from IVM Applications menu (Applications>FinnGen>Superset).
In Explore tab the user can do different visualizations to the data using the graphical interface. In the SQL Lab tab direct SQL queries can be done to the data.
Take a look how to export FINNGENID from Atlas using SuperSet.
How to save results
It is possible to save results into Sandbox specific BQ database named “Sandbox” (see available databases from Superset Database tab). Users can import small custom tables BQ database “Sandbox” and export result tables to Sandbox IVM as a csv file.
Superset demo
See a Superset tutorial video from FinnGen Users' Meeting 22th Sebtember 2020 (at 50min).
More information:
bq command-line tool
It is also possible to access the BigQuery tables using bq command-line tool which is a Python-based command-line tool. bq command-line tool can be used to run queries.
Before running a query, users can check how much does it cost of running query using bq command-line tool
bq query --dry_run
The dry run will not execute the query but rather gives out how much data it consumes. In general, queries are priced using on-demand with estimate of $6.25 per Tebibyte (TiB). More details on pricing structure can be found here.
Google Cloud BigQuery python and R drivers
It is also possible to use google cloud BigQuery python and R drivers to access data directly from IVM.
See a tutorial video about how to conduct BigQuery using Python and R scripts from Users' meeting recordings (at 30min 2sec).
Use cases
Links on how to connect to BigQuery in R and Python along with some use cases for downstream analysis
Example Python script:
from google.cloud import bigquery
def examplequery():
# Define project running queries == users own sandbox project,
# by default it matches your SB environment.
# see Sandbox no from Sandbox IVM desktop;buckets.txt
client = bigquery.Client()
# Tables must be defined in format project.dataset.table,
# note that cohort 5 must be generated in Atlas.
query_job = client.query(
"""
SELECT person_source_value
FROM finngen-production-library.finngen_omop_result.cohort
LEFT JOIN finngen-production-library.finngen_omop.person ON cohort.subject_id = person.person_id
WHERE cohort_definition_id = 5
ORDER BY person_source_value
LIMIT 20
"""
)
results = query_job.result()
print("20 first finngen ids from atlas cohort 5")
for row in results:
print("{}".format(row.person_source_value))
if __name__ == "__main__":
examplequery()
Example R script:
#!/usr/bin/Rscript
library(bigrquery)
# Define scope for queries. Currently only readonly is enabled
bq_auth(scopes="https://www.googleapis.com/auth/bigquery.")
# Define project running queries == users own sandbox project
projectid = "fg-production-sandbox-6"
#note that cohort 5 must be generated in Atlas
sql <- "SELECT person_source_value
FROM finngen-production-library.finngen_omop_result.cohort
LEFT JOIN finngen-production-library.finngen_omop.person ON cohort.subject_id = person.person_id
WHERE cohort_definition_id = 5
ORDER BY person_source_value
LIMIT 20"
tb <- bq_project_query(projectid, sql)
df <- bq_table_download(tb)
print("20 first finngen ids from atlas cohort 5")
print(df[["person_source_value"]])
Last updated
Was this helpful?