# BigQuery (relational database)

### Introduction

BigQuery is a database technology useful in storing and analyzing large and complex datasets like longitudinal data in FinnGen. This makes it easier for user to query, using SQL, subset of data much faster than reading entire dataset and then filter out. Queried data can be directly inserted into downstream analysis.

### Data

`finngen-production-library` project/database contains phenotype data for different release of FINNGEN data. The data in this project/database comes from **LIBRARY\_RED** and **LIBRARY\_GREEN** folders within [Sandbox IVM folders](https://finngen.gitbook.io/finngen-analyst-handbook/working-in-the-sandbox/running-analyses-in-sandbox/sandbox-path-and-pipeline-mappings).

The table below lists the available BigQuery tables. Additionally, to get the most recent list directly in Sandbox, type the below commands in RStudio. The list of tables will be visible under the "Connections" tab.

```
library(bigrquery)
projectid<-"fg-production-sandbox-nro" ##replace nro with your sandbox number

bq_con<-dbConnect(bigrquery::bigquery(),
                  project="finngen-production-library",
                  billing=projectid)
```

<table data-full-width="false"><thead><tr><th align="center">BigQuery Project/Database</th><th align="center">BigQuery Dataset/Schema</th><th align="center">BigQuery Table Name</th><th align="center">BigQuery Table Details</th><th align="center">Sandbox IVM folder</th><th align="center">Release Version</th></tr></thead><tbody><tr><td align="center">finngen-production-library</td><td align="center">sandbox_tools_r{<em>RELEASE</em>}</td><td align="center">finngen_r13_service_sector_detailed_longitudinal_v1</td><td align="center"><a href="/pages/9MhcSPoiHtwk4qo5gawD">Service Sector Detailed Longitudinal data</a></td><td align="center">LIBRARY_RED</td><td align="center">DF13</td></tr><tr><td align="center"></td><td align="center"></td><td align="center">endpoint_cohorts_r13_v3</td><td align="center"><a href="/pages/56HeK0RR2xkNVAMxF0Nr">Endpoint cohorts</a></td><td align="center">LIBRARY_RED</td><td align="center">DF13</td></tr><tr><td align="center"></td><td align="center"></td><td align="center">code_counts_r12_v1</td><td align="center">Code Counts</td><td align="center">LIBRARY_RED</td><td align="center">DF12</td></tr><tr><td align="center"></td><td align="center"></td><td align="center">minimum_extended_r13_v1</td><td align="center"><a href="/pages/CNGd2sZcrdKpRvtywdFW">Phenotype Minimum Extended data</a></td><td align="center">LIBRARY_RED</td><td align="center">DF13</td></tr><tr><td align="center"></td><td align="center"></td><td align="center">birth_mother_r13_v1</td><td align="center"><a href="/pages/iN91r6pjWbK9T79E9Ydx">Birth and DVV registry of Mother</a></td><td align="center">LIBRARY_RED</td><td align="center">DF12</td></tr><tr><td align="center"></td><td align="center"></td><td align="center">vision_r13_v1</td><td align="center"><a href="/pages/h5nazcYtImIA7fOqQgJn">Visual impairment registry</a></td><td align="center">LIBRARY_RED</td><td align="center">DF13</td></tr><tr><td align="center"></td><td align="center"></td><td align="center">kidney_r13_v1</td><td align="center"><a href="/pages/T4Dt07Fkrj1DP3gWH5IS">Kidney disease registry</a></td><td align="center">LIBRARY_RED</td><td align="center">DF12</td></tr><tr><td align="center"></td><td align="center"></td><td align="center">kanta_r13_v1</td><td align="center"><a href="/pages/960WiPPbMcoM8lasQ4H5">Kanta Lab values registry</a></td><td align="center">LIBRARY_RED</td><td align="center">DF13</td></tr><tr><td align="center"></td><td align="center">medical_codes</td><td align="center">finngen_vnr_v2</td><td align="center"><a href="https://finngen.gitbook.io/finngen-analyst-handbook/finngen-data-specifics/finnish-health-registers-and-medical-coding/vnr-code-mapping-to-rxnorm">VNR information</a></td><td align="center">LIBRARY_GREEN</td><td align="center">INDEPENDENT</td></tr><tr><td align="center"></td><td align="center"></td><td align="center">fg_codes_info_v10</td><td align="center">Code translation info table</td><td align="center">LIBRARY_GREEN</td><td align="center"></td></tr><tr><td align="center"></td><td align="center">finngen_omop_r13</td><td align="center"></td><td align="center"><a href="https://ohdsi.github.io/CommonDataModel/cdm54.html#Clinical_Data_Tables">OMOP CDM tables</a></td><td align="center">LIBRARY_RED</td><td align="center">DF13</td></tr><tr><td align="center"></td><td align="center">finngen_results_r13</td><td align="center"></td><td align="center">Achilles tables</td><td align="center">LIBRARY_RED</td><td align="center">DF13</td></tr></tbody></table>

### bq command-line tool

It is also possible to access the BigQuery tables using [bq command-line tool](https://cloud.google.com/bigquery/docs/bq-command-line-tool) which is a Python-based command-line tool. [bq command-line tool](https://cloud.google.com/bigquery/docs/bq-command-line-tool) can be used to run queries.

Before running a query, users can check how much does it cost of running query using [bq command-line tool](https://cloud.google.com/bigquery/docs/bq-command-line-tool)

```
bq query --dry_run
```

The [dry run](https://www.google.com/url?q=https://cloud.google.com/bigquery/docs/best-practices-costs%23perform_dry_runs\&sa=D\&source=docs\&ust=1695633973100701\&usg=AOvVaw0WC6b1CshQXHVoivHICAbZ) will not execute the query but rather gives out how much data it consumes. In general, queries are priced using on-demand with estimate of $6.25 per Tebibyte (TiB). More details on pricing structure can be found [here](https://cloud.google.com/bigquery/pricing).

### Google Cloud BigQuery python and R drivers

It is also possible to use google cloud BigQuery python and R drivers to access data directly from IVM.

See a tutorial video about how to conduct BigQuery using Python and R scripts from [Users' meeting recordings](https://www.finngen.fi/en/members/recordings/finngen-data-users-meeting-22nd-nov-2022) (at 30min 2sec).

### Use cases

Links on how to connect to BigQuery in R and Python along with some use cases for downstream analysis

* Python
  1. [Connection](/working-in-the-sandbox/which-tools-are-available/miscellaneous-helper-scripts-tools/tool-to-annotate-variants-with-rsids-1.md)
  2. [Active Ingredient - Bar plot](/working-in-the-sandbox/which-tools-are-available/miscellaneous-helper-scripts-tools/tool-to-annotate-variants-with-rsids-1/bigquery-python-downstream-analysis-active-ingredient-bar-plot.md)
  3. [Sex difference - Pyramid plot](/working-in-the-sandbox/which-tools-are-available/miscellaneous-helper-scripts-tools/tool-to-annotate-variants-with-rsids-1/bigquery-python-case-study-sex-different-tornado-plot.md)
  4. [Comorbidity - Upset plot](/working-in-the-sandbox/which-tools-are-available/miscellaneous-helper-scripts-tools/tool-to-annotate-variants-with-rsids-1/bigquery-python-case-study-comorbidity-upset-plot.md)
  5. [Patient Timeline - Scatter plot](/working-in-the-sandbox/which-tools-are-available/miscellaneous-helper-scripts-tools/tool-to-annotate-variants-with-rsids-1/bigquery-python-case-study-patient-timeline-scatter-plot.md)
* R
  1. [Connection](/working-in-the-sandbox/which-tools-are-available/miscellaneous-helper-scripts-tools/bigquery-connection-r.md)
  2. [All Register data of a person](/working-in-the-sandbox/which-tools-are-available/miscellaneous-helper-scripts-tools/bigquery-connection-r/case-study-all-register-data-for-a-person.md)
  3. [Upset plot](/working-in-the-sandbox/which-tools-are-available/miscellaneous-helper-scripts-tools/bigquery-connection-r/case-study-upset-plot.md)
  4. [Tornado plot](/working-in-the-sandbox/which-tools-are-available/miscellaneous-helper-scripts-tools/bigquery-connection-r/case-study-tornado-plot.md)

#### Example Python script:

```
from google.cloud import bigquery
def examplequery():
# Define project running queries == users own sandbox project, 
# by default it matches your SB environment.
# see Sandbox no from Sandbox IVM desktop;buckets.txt

client = bigquery.Client()

# Tables must be defined in format project.dataset.table, 
# note that cohort 5 must be generated in Atlas.

query_job = client.query(
 """
 SELECT person_source_value
 FROM finngen-production-library.finngen_omop_result.cohort
 LEFT JOIN finngen-production-library.finngen_omop.person ON cohort.subject_id = person.person_id
 WHERE cohort_definition_id = 5
 ORDER BY person_source_value
 LIMIT 20
 """
 )
 results = query_job.result()
 print("20 first finngen ids from atlas cohort 5")
 for row in results:
 print("{}".format(row.person_source_value))
if __name__ == "__main__":
 examplequery()
```

#### Example R script:

```
#!/usr/bin/Rscript
library(bigrquery)

# Define scope for queries. Currently only readonly is enabled
bq_auth(scopes="https://www.googleapis.com/auth/bigquery.")

# Define project running queries == users own sandbox project
projectid = "fg-production-sandbox-6"

#note that cohort 5 must be generated in Atlas
sql <- "SELECT person_source_value
 FROM finngen-production-library.finngen_omop_result.cohort
 LEFT JOIN finngen-production-library.finngen_omop.person ON cohort.subject_id = person.person_id
 WHERE cohort_definition_id = 5
 ORDER BY person_source_value
 LIMIT 20"
tb <- bq_project_query(projectid, sql)
df <- bq_table_download(tb)
print("20 first finngen ids from atlas cohort 5")
print(df[["person_source_value"]])
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.finngen.fi/working-in-the-sandbox/which-tools-are-available/bigquery-relational-database.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
