# Extracting minimum phenotype data per biobank

### **Auria Biobank**

30.3.2022

**Data source**: TYKS Datalake (EHR)

**Longitudinal data reported**: no

**Gender**: male/female, extracted from the personal identity code

**Age**: years, at the time of sample collection; calculated using the date of birth (extracted from the personal identity code) and the date of sample collection

**Heigh**: cm, extracted from structured EHR data (“hoitotaulukko”)

**Date(height)**: date of the height measurement

**Weight**: kg, extracted from structured EHR data (“hoitotaulukko”)

**Date(weight)**: date of the weight measurement

**Smoking status**: current smoker / former smoker / never smoker / NA, extracted from medical reports by text mining algorithms

**Date(smoking)**: date of the smoking status

**Note:** Text mining of the smoking status has been developed by the data analysis team of Auria Biobank. The method is based on the classification rules. The accuracy of the method has been shown to be about 80-90% from the patients having the smoking information in their medical reports. Unfortunately, about 35% of the patients do not have any information, and they will get result NA.

### **Biobank of Eastern Finland**

18.3.2022

**Data source**: Kys Datalake (EHR), CORE consent management system

**Longitudinal data reported**: no

**Gender**: male/female, extracted from the personal identity code

**Age**: years, at the time of sample collection; calculated using the date of birth and the date of sample collection

**Heigh**: cm, extracted from structured EHR data

**Date(height)**: date of the height measurement

**Weight**: kg, extracted from structured EHR data

**Date(weight)**: date of the weight measurement

**Smoking status**: current smoker / former smoker / never smoker / NA, extracted from medical reports by text mining algorithms

**Date(smoking)**: date of the smoking status

**Note:** -

### FRC Blood Service Biobank

4.6.2022

Data source: Questionnaire in connection with the biobank consent

Longitudinal data reported: no

Gender: male/female, extracted from the personal identity code

Age: years, at the time of sample collection; calculated using the date of birth and the date of sample collection

Height: cm, self-reported;

Date(height): date of the biobank consent;

Weight: kg, self-reported;

Date(weight): date of the biobank consent;

Smoking status: regular smoker (years) / irregular smoker (years)/ former smoker (years) / never smoker, self-reported

*Date(smoking):* date of the biobank consent;

Note: -

### **Central Finland Biobank**

18.3.2022

**Data source**: Central Finland Health Care District EHR databases

**Longitudinal data reported**: no

**Gender**: male/female, extracted from the personal identity code

**Age**: years, at the time of sample collection; calculated using the date of birth and the date of sample collection

**Height**: cm, extracted from structured EHR data

**Date (height)**: date of the height measurement

**Weight**: kg, extracted from structured EHR data

**Date (weight)**: date of the weight measurement

**Smoking status**: current smoker / previous smoker / non-smoker / never smoked / NA, extracted from structured EHR data

**Date (smoking)**: date of the smoking status

**Note**: Height, weight, and smoking status are constructed from several health records, which contain this information in structured format. Height and weight information are retrieved for about 25% of sample donors. Smoking status is retrieved for about 50% of sample donors. Data mining is not currently used.

### **Finnish Clinical Biobank Tampere**

17.3.2022

**Data source**: PSHP Datalake (EHR)

**Longitudinal data reported**: no

**Gender**: male/female, extracted from structured EHR data

**Age**: years, at the time of sample collection; calculated using the date of birth and the date of sample collection

**Height**: cm, extracted from structured EHR data

**Date(height)**: date of the height measurement, using the nearest date of the sampling time

**Weight**: kg, extracted from structured EHR data

**Date(weight)**: date of the weight measurement, using the nearest date of the sampling time

**Smoking status**: NA

**Note:** -

### **Helsinki Biobank**

11.3.2022

**Data source**: HUS Datalake (EHR)

**Longitudinal data reported**: no

**Gender**: male/female, extracted from the personal identity code

**Age**: years, at the time of sample collection; calculated using the date of birth and the date of sample collection

**Height**: cm, extracted from structured EHR data (for DF11 and beyond also extracted from medical reports by text mining algorithms)

**Date(height)**: date of the height measurement

**Weight**: kg, extracted from structured EHR data (for DF11 and beyond also extracted from medical reports by text mining algorithms)

**Date(weight)**: date of the weight measurement

**Smoking status**: current smoker / former smoker / never smoker / NA, extracted from medical reports by text mining algorithms

**Date(smoking)**: date of the smoking status

**Note:** Text mining of the smoking status was based on FinBERT (<https://github.com/TurkuNLP/FinBERT>), which is Google's BERT deep transfer learning model for Finnish. The smoking status -classifier was evaluated with a set of 947 patients and the obtained accuracy and F-score was 94.5%. More specifically, F-scores for a smoker, ex-smoker, non-smoker, and NA were 92.2%, 91.8%, 98.0%, and 73.7%, respectively. The evaluation set had no duplicate patients among the development set.

### **Northern Finland Biobank Borealis**

25.3.2022

**Data source**: BC Platforms, Esko Systems

**Longitudinal data reported**: no

**Gender**: male/female, extracted from the personal identity code

**Age**: years, at the time of sample collection; calculated using the date of birth and the date of sample collection (BC Platforms)

**Height**: cm, extracted manually from Esko patient information system.

**Date(height):** measurement closest to the blood sampling time; exact dates not recorded in the minimum data set file.

**Weight**: kg, extracted manually from Esko patient information system.

**Date(weight):** measurement closest to the blood sampling time; exact dates not recorded in the minimum data set file.

**Smoking status**: current smoker / former smoker / never smoker / NA, extracted manually from Esko patient information system.

**Date(smoking):** most recent recording in the EHR or recording closest to the blood sampling time (if available); exact dates not recorded in the minimum data set file.

**Note:** Extracting data from Esko patient information system has to be done manually one person at a time and the process is very laborious. Currently, Borealis does not have access to patient information in a structured format extractable from a database.

### **Terveystalo Biobank**

Terveystalo’s biobank samples, and registries were transferred to Auria Biobank in 2025

30.3.2022

**Data source**: Terveystalo Datalake (EHR)

**Longitudinal data reported**: no

**Gender**: male/female, extracted from the personal identity code

**Age**: years, at the time of sample collection; calculated using the date of birth and the date of sample collection

**Height**: cm, extracted from structured EHR data

**Date(height)**: date of the height measurement

**Weight**: kg, extracted from structured EHR data

**Date(weight)**: date of the weight measurement

**Smoking status**: current smoker / non-smoker / NA, extracted from structured EHR data

**Date(smoking)**: date of the smoking status

**Note:** -

### THL Biobank

14.4.2022

**Data source:** THL Biobank phenotype database PhenoWeb (and biobank cohort data files extracted from cohort databases)

Longitudinal data reported: no

**Gender:** male/female. Mainly extracted from biobank cohort data. For a subset extracted from the personal identity code.

**Age:** years, at the time of sample collection; either directly obtained from the biobank cohort data or calculated using the date of birth and the date of sample collection

**Height:** cm, extracted from biobank database or cohort data files. Transformed from m into cm if needed.

**Date(height):** date of the height measurement. Reported only for those cohorts when sampling date didn’t match the height measurement date, otherwise the same as sampling date.

**Weight:** kg, extracted from biobank database or cohort data files.

**Date(weight):** date of the weight measurement Reported only for those cohorts when the sampling date didn’t match the height measurement date, otherwise the same as the sampling date.

**Smoking status:** The smoking-related attributes have been extracted from the biobank database and/or cohort data files and released for the project. The smoking data has been delivered to the extent that has been transferred to THL Biobank per cohort. Different cohorts have different attributes, and some cohorts don’t include smoking attributes at all. The broad smoking data available from THL Biobank has then been harmonized and used in FinnGen to cover smoking status in more detail.​

**Date(smoking):** date of the smoking data collection. Reported for those cohorts when sampling date did not match the smoking data collection date, otherwise the same as sampling date.​

**Note:** The minimum datasets were extracted one cohort at a time, and the protocol varied slightly depending on the cohort at hand.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.finngen.fi/finngen-data-specifics/red-library-data-individual-level-data/what-phenotype-files-are-available-in-sandbox-1/minumum-extended-phenotype-data/extraction-of-finngen-minimum-data-set-information-per-biobank.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
