BigQuery (relational database)
Last updated
Was this helpful?
Last updated
Was this helpful?
BigQuery is a database technology useful in storing and analyzing large and complex datasets like longitudinal data in FinnGen. This makes it easier for user to query, using SQL, subset of data much faster than reading entire dataset and then filter out. Queried data can be directly inserted into downstream analysis.
finngen-production-library
project/database contains phenotype data for different release of FINNGEN data. The data in this project/database comes from LIBRARY_RED and LIBRARY_GREEN folders within .
finngen-production-library
sandbox_tools_r12
finngen_r12_service_sector_detailed_longitudinal_v1
LIBRARY_RED
DF12
endpoint_cohorts_r12_v1
Endpoint cohorts
LIBRARY_RED
DF12
code_counts_r12_v1
Code Counts
LIBRARY_RED
DF12
minimum_extended_r12_v1
LIBRARY_RED
DF12
birth_mother_r12_v1
LIBRARY_RED
DF12
vision_r12_v1
LIBRARY_RED
DF12
kidney_r12_v1
LIBRARY_RED
DF12
kanta_r12_v1
LIBRARY_RED
DF12
medical_codes
finngen_vnr_v1
LIBRARY_GREEN
INDEPENDENT
fg_codes_info_v6
Code translation info table
LIBRARY_GREEN
finngen_omop_r12
LIBRARY_RED
DF12
finngen_results_r12
Achilles tables
LIBRARY_RED
DF12
Superset is a web tool that can be used to explore FinnGen data stored in BiqQuery database. Users can do SQL queries to the data and see the results as a data visualization. Superset is accessed from IVM Applications menu (Applications>FinnGen>Superset).
In Explore tab the user can do different visualizations to the data using the graphical interface. In the SQL Lab tab direct SQL queries can be done to the data.
It is possible to save results into Sandbox specific BQ database named “Sandbox” (see available databases from Superset Database tab). Users can import small custom tables BQ database “Sandbox” and export result tables to Sandbox IVM as a csv file.
It is also possible to use google cloud BigQuery python and R drivers to access data directly from IVM.
Links on how to connect to BigQuery in R and Python along with some use cases for downstream analysis
Python
R
Take a look .
See a Superset tutorial video from (at 50min).
It is also possible to access the BigQuery tables using which is a Python-based command-line tool. can be used to run queries.
Before running a query, users can check how much does it cost of running query using
The will not execute the query but rather gives out how much data it consumes. In general, queries are priced using on-demand with estimate of $6.25 per Tebibyte (TiB). More details on pricing structure can be found .
See a tutorial video about how to conduct BigQuery using Python and R scripts from (at 30min 2sec).