Data

Data location

Sandbox

  • /finngen/library-red/finngen_R13/kanta_lab_1.0/data/finngen_R13_kanta_lab_1.0.txt.gz in textual TSV-gzipped format (for use with awk, grep, UNIX piping)

  • /finngen/library-red/finngen_R13/kanta_lab_1.0/data/finngen_R13_kanta_lab_1.0.parquet in binary Parquet format (for use with Python pandas, R data.frame)

BigQuery

Available in this table: finngen-production-library.sandbox_tools_r12.kanta_r12_v1

Data columns

N.B. The raw data contains a MEASUREMENT_FREE_TEXT column that unfortunately cannot be directly released as it contains data that is potentially sensitive. It contains a mix of numerical measurement values, positive/negative outcomes, outcomes linked to thresholds (e.g. <3ml) and general notes. Our approach has been to extract such data from the original column through a process of cleaning and whitelisting of the field.

Overview

Column
Description
SB
ETL

FINNGENID

Study ID (Pseudonymised ID given to the FinnGen participant)

SEX

Sex of the individual, female or male

EVENT_AGE

Age (in years) at time of event, e.g. 12.012

APPROX_EVENT_DATETIME

Date (randomized) and time (not randomized) of event, e.g. 2020-01-02T07:30 (see details)

TEST_NAME

Short name of the lab test, e.g. p-alat, s-tsh

TEST_ID

Code of the lab test (national or local lab test ID)

TEST_ID_IS_NATIONAL

Whether or not the TEST_ID is using the national lab test code system

OMOP_CONCEPT_ID

OMOP Concept ID mapped from the TEST_ID and MEASUREMENT_UNIT

MEASUREMENT_VALUE

Value of the test measurement

MEASUREMENT_UNIT

Corresponding unit for the test measurement

MEASUREMENT_VALUE_HARMONIZED

Value of the test measurement, after harmonization across the OMOP Concept ID

MEASUREMENT_UNIT_HARMONIZED

Corresponding unit for the harmonized measurement value

TEST_OUTCOME

Label given for the outcome of the test to indicate how it falls against the reference range (see value table)

TEST_OUTCOME_IMPUTED

Imputed test outcome (see value table)

MEASUREMENT_STATUS

Code indicating the status of the lab test measurement (see value table)

REFERENCE_RANGE_GROUP

Reference range for this event, as text

REFERENCE_RANGE_LOW_VALUE

Value for the low bound of the reference range

REFERENCE_RANGE_LOW_UNIT

Corresponding unit for the low bound of the reference range

REFERENCE_RANGE_HIGH_VALUE

Value for the high bound of the reference range

REFERENCE_RANGE_HIGH_UNIT

Corresponding unit for the high bound of the reference range

CODING_SYSTEM_ORG

Derived from CODING_SYSTEM_OID

CODING_SYSTEM_OID

Original name: tutkimuskoodistonjarjestelmaid

TEST_ID_SOURCE

Code of the lab test, as it appeared before preprocessing of the data

TEST_NAME_SOURCE

Short name of the lab test, as it appeared before preprocessing of the data

MEASUREMENT_VALUE_SOURCE

Value of the test measurement, as it appeared before data cleaning

MEASUREMENT_UNIT_SOURCE

Unit of the test measurement, as it appeared before data cleaning

MEASUREMENT_VALUE_EXTRACTED

Value of the test measurement extracted from the MEASUREMENT_FREE_TEXT column

MEASUREMENT_VALUE_MERGED

Harmonized and extracted values merged together

OUTCOME_POS_EXTRACTED

1(pos) or 0 (neg) outcome extracted from the MEASUREMENT_FREE_TEXT column

TEST_OUTCOME_TEXT_EXTRACTED

[<|>]|[VALUE]|[UNIT?] extracted from the MEASUREMENT_FREE_TEXT column

TEST_OUTCOME

This column provides a label comparing the measured value against a reference range.

Value
Description

N

Normal

A

Abnormal

AA

Very abnormal

L

Low

LL

Very low

H

High

HH

Very high

TEST_OUTCOME_IMPUTED

Some rows are missing the TEST_OUTCOME, so an imputed one is provided. The TEST_OUTCOME_IMPUTED is derived by looking at the data from the same OMOP Concept ID for which there are MEASUREMENT_VALUE and TEST_OUTCOME for a minimum (100) number of entries. The process for determining the thresholds are as following. Values with both measurement (harmonized) and outcome are sorted by value, with the outcome labels sorted following the same order. E.g.

Value
OUTCOME

1

L

1

L

1.3

N

...

...

7

N

14

H

15

H

Starting from the lower end, we expect to find mostly low (L) values and then gradually find normal (N) ones. So in order to find the turnover point where Ns become the majority we define a relative measure of # of L entries/ all other entries. In ideal scenarios, this value starts at 100% and start to gradually decline as more Ns (or other entries, like A and H) start to appear. When the relative measure drops under 95% for the last time, we define the threhold there. The same is done from the oppoiste side with H. The summary of the threshold can be found in the repo. In the process we found two kind of anomalies, mainly due to an asymmetric distribution of labels:

  • +- inf thresholds. In these cases not enough labels are present at the tails of the distribution. In the algorithm the starting thresholds are defined as such, but they never get updated as the ratio of labels never climbs above the 95% threshold to begin with. This is usually associated with lab values where there is no such thing as L/H (e.g. Triglycerides) or where the labels used ar A instead of H|L

  • PROBLEM column. This boolean column indicates when the opposite issue appears, that is we traverse the whole list of values up to the median still being above the 95% threshold and the median value is therefore used as a threshold. This indicates that there's a heavy bias in the distribution of outcome labels and thus one should proceed with caution. Values imputed with these thresholds are labelled with a * , e.g. L* or H*

The content of the column is as following:

Value
Description

N

Imputed Normal

L

Imputed Low

L*

Imputed Low. Less confidence in the imputation due to over-representation of L and H from TEST_OUTCOME

H

Imputed High

H*

Imputed High. Less confidence in the imputation due to over-representation of L and H from TEST_OUTCOME

MEASUREMENT_STATUS

Value
Description

C

Corrected result

F

Final result

R

Unverified result

S

Partial result

Pipeline

The pipeline is available in github (https://github.com/FINNGEN/kanta_lab_preprocessing/) where technical information on how the raw data was processed can be found.

A quick summary:

  • duplicate entries are removed (based on id,date,lab test name/code/measurement status & value)

  • text is processed to remove spaces and strange characters

  • test national codes are mapped to names based on known mappings

  • units are cleaned/uniformized and mapped to OMOP based on lab test

  • units are harmonized based on OMOP IDs

  • Another duplication removal step takes place post harmonization to intercept duplicate entries from different systems (checking for ID,date,harmonized test name,value and status)

Additional dataset: Values extraction analysis

On top of the core kanta data, another data set is released, meant for analysis. The idea behind this file is to focus more on numerical values and to manipulate/remove entries for downstream analysis, like flagging/removing problematic values. This file misses some columns from the original data but also contains new ones for analysis purposes. In this way we can keep separate the pure data munging/harmonization from the numerical elaboration of the data for analysis purposes

Summary

The pipeline is available in github (https://github.com/FINNGEN/kanta_lab_preprocessing/) where technical information on how the raw data was processed can be found.

A quick summary:

  • the MEASUREMENT_FREE_TEXT column is manipulated to extract shareable information

    • Where the original measurement value is missing and the free text is available, we attempt to extract numerical values from it if they match certain patterns. After some string manipulation if we're left with a pure number we cast it from string to float and is used to populate the MEASUREMENT_VALUE_EXTRACTED column

    • The text is scanned for pos/neg substrings and through a manual mapping, values are mapped to 1 (pos) or 0 (neg) in a new OUTCOME_POS_EXTRACTED column

    • The text is scanned to look for entries that indicate outcome as a comparison and are structured as such:

      • comparison (Yli/alle/</>)

      • numerical value

      • unit (potentially missing)

      These entries are manipulated in order to be standardized following the format [<|>]|[VALUE]|[UNIT?] so they can be shared safely.

  • QCing takes place to remove extracted values that are formatted as dates

Location

/finngen/library-red/finngen_R13/kanta_analysis_1.0/

Like for the full munged data, there are two files:

  • finngen_R13_kanta_analysis_1.0.parquet finngen_R13_kanta_analysis_1.0.txt.gz

  • finngen_R13_kanta_analysis_1.0.parquet finngen_R13_kanta_analysis_1.0.parquet

Extraction Summary

In the following table one can find a summary of the free text extraction process.

OMOP
N_EXTRACTED
%_EXTRACTED
%_NA_MEASUREMENT
N_POSNEG
%_EXTRACTED
%_NA_OUTCOME
conceptName

OMOP ID

N of extracted numerical values

Percentage of numerical values extracted

Percentage of extracted values that had NA in raw data measurement

N of extracted POS/NEG values

Percentage of POS/NEG extracted values

Percentage of extracted values that had NA in raw data outcome

Concept Name

3026361

2095662

22.6799

100.0000

2

0.0000

100.0000

Erythrocytes [#/volume] in Blood

3018095

118284

22.3749

100.0000

67950

12.8536

6.2384

Leukocytes [#/volume] in Urine

On top of the core kanta data, another data set is released, meant for analysis. The idea behind this file is to focus more on numerical values and to manipulate/remove entries for downstream analysis, like flagging/removing problematic values. This file misses some columns from the original data but also contains new ones for analysis purposes. In this way we can keep separate the pure data munging/harmonization from the numerical elaboration of the data for analysis purposes

Other reference tables

Test name abbreviations

Test name abbreviations come from different laboratory testing centers around Finland. Some are standardized nationally and some are used only locally in different hospitals and test centers.

We have put a lot of effort into standardizing these to international OHDSI OMOP Concept ID (primarily from LOINC) so we hope that you do not need to interpret them very often! However, in case you have reason to use them, we provide the meaning of most abbreviations here.

Prefixes for lab test name abbreviations

aB

Arterial blood

Af

Puncture fluid

aG

Alveolar gas

Am

Amniotic fluid

As

Ascitic fluid

B

Blood

Bf

Bronchus fluid

Bi

Bile

Bl

Bronchoalveolar lavation

Bm

Bone Marrow

Bo

Bone

Br

Breast

Bu

Bursa

Ca

Cannula/IV port

cB

Capillary blood

Cf

Cervix fluid

Cn

Central nervous system

cU

Collected urine

Cv

Choroid villus

Di

Dialysis fluid

Dj

Duodenal juice

dU

Diurnal urine

E

Erythrocyte

Ex

Sputum

F

Fecal

fB

Fasting blood

Fl

Vaginal fluor

fP

Fasting plasma

fS

Fasting serum

Gi

Gastrointestinal

Gj

Gastric juice

Hb

Hemoglobin

He

Heart

Ki

Kidney

L

Leukocytes

Lf

Lacrimal fluid

Li

Likvor/CSF

Ln

Lymph Node

Lr

Liver

Lu

Lung

Ly

Lymphocytes

M

Muscle

mB

Machine blood

Me

Meconium

Mf

Mammary fluid

Mm

Maternal milk

Mu

Mucosa

Ne

Nerve

Ns

Nasal secretion

nU

Nocturnal urine

P

plasma

Pd

Peritoneal dialysis

Pf

Pleura

Pi

Pituitary gland

Pl

Placenta

Pp

Periodontal pocket

Ps

Pharyngeal secretion

Pt

Patient

Pu

Pus

S

Serum

Sa

Saliva

Se

Secretion

Sk

Skin

Sp

Semen

Sw

Sweat

Sy

Syncytial fluid

T

Thrombocyte

Ts

Tissue

Tu

Tumor

U

Urine

uA

Umbilical arterial blood

Ug

Urogenital

uS

Umbilical serum

uV

Umbilical venous blood

vB

Venous blood

W

Water

Suffixes for lab test name abbreviations

-Ab

Antibody

-AbA

IgA antibody

-AbE

IgE antibody

-AbG

IgG antibody

-AbM

IgM antibody

-Ag

Antigen

-Akt

Activity

-Aktt

Activation products

-Cl

Clearance

-Ct

Control

-D

DNA

-Di

Dialysis

-EVi

Special culture

-EM

Electron Microscopic

-F

Fetal

-Fc

Flow cytometry

-Fr

Fraction

-Gr

Gestational

-IF

Immunofluorescence

-IH

Immunohistochemistry

-Ind

Index

-Ion

Ionized

-Is

Iso enzymes

-ISH

in situ -hybridisation

-Jtk

Follow-up study

-Jvi

Follow-up culture

(jatkoviljely)

-Kj

Conjugate

-Lm

Species specificity

-MS

Mass spectrometry

-Nh

Nucleic acid

-O

Qualitative

-Oc

Oligoclonal

-Pa

Long term

-Pse

Screening and categorization

-PT

Rapid test

-R

Exercise stress test

-S

Stimulation

-Sc

Sub classes

-Ty

Typing

-V

Free or unconjugated

-Vi

Microbiology culture (e.g. u-Baktvi = bacterial culture from urine, ps-stravi = Strep A culture in pharayngeal secretion, F-sienVi = fungal culture in stool)

-Vit

Vitamine

-Vr

Staining

-Vt

Point of care (vieritesti), often a rapid test

Reference range terms

Test reference ranges are a free text string that can have a lot of Finnish in them. For those who don’t speak Finnish, we provide here translations of some of the common words you will see in reference ranges:

General terms:

  • AIKUISET: Adults

  • ALLE: Under/Below

  • ALK: Abbreviation for "alkaen", meaning "starting from" or "beginning at"

  • ALTISTUMATTOMAT: Unexposed (individuals)

  • AAMUNÄYTE: Morning sample

  • EDELLEEN: Still, continuing

  • FERTIILI-IKÄ: Fertile age

  • HOITOALUE: Treatment range

  • JA: And

  • JÄÄNNÖSPIT: Residual concentration

  • KAIKKI: All, everyone

  • KATSO: See, look at

  • KK: Abbreviation for "kuukausi", meaning month

  • KS: Abbreviation for "katso", meaning "see" or "look at"

  • KTS: Another abbreviation for "katso"

  • KYMENLAAKSONLAB: Kymenlaakso Laboratory (a specific lab in Finland)

  • LAPSET: Children

  • LEUK: Leukocytes (white blood cells)

  • LIER: Likely referring to "lieriöt", meaning casts (in urine analysis)

  • MIEHET: Men

  • NAISET: Women

  • NEGAT: Negative

  • NORMAALI: Normal

  • OHJEKIRJA: Manual, guidebook

  • PAASTO: Fasting

  • POJAT: Boys

  • POSTMENOPAUSSI: Postmenopausal

  • PREMENOPAUSAALISET: Premenopausal

  • PUBERT: Puberty

  • RASKAUS: Pregnancy

  • SUOSITELTAVA: Recommended

  • TAVOITE: Target, goal

  • TAVOITEARVO: Target value

  • TERAP: Therapeutic

  • TOKSINEN: Toxic

  • TULKINTA: Interpretation

  • TUPAKOIMATTOMAT: Non-smokers

  • TYTÖT: Girls

  • V: Abbreviation for "vuosi", meaning year

  • VASTASYNT: Newborn

  • VIITEARVO: Reference value

  • VKO: Abbreviation for "viikko", meaning week

  • VRK: Abbreviation for "vuorokausi", meaning day (24-hour period)

  • YLI: Over, above

Age-related terms:

  • 0-6PV: 0-6 days

  • 1KK-1V: 1 month to 1 year

  • 1V-: 1 year and older

  • 2-4V: 2-4 years

  • 5-10V: 5-10 years

  • 11-15V: 11-15 years

  • 16V-: 16 years and older

  • 18V-: 18 years and older

Medical terms:

  • ERYT: Erythrocytes (red blood cells)

  • EPIT.SOLUT: Epithelial cells

  • FOLLIKK.VAIHE: Follicular phase (of menstrual cycle)

  • MAKUU: Lying down (usually referring to blood pressure measurement)

  • MENARKEA: Menarche (first menstrual period)

  • PYSTY: Standing (usually referring to blood pressure measurement)

Last updated

Was this helpful?