Data

Data location

Sandbox

/finngen/library-red/finngen_R13/kanta_lab_2.0/data/finngen_R13_kanta_lab_2.0.txt.gz in textual TSV-gzipped format (for use with awk, grep, UNIX piping)
/finngen/library-red/finngen_R13/kanta_lab_2.0/data/finngen_R13_kanta_lab_2.0.parquet in binary Parquet format (for use with Python pandas, R data.frame)

BigQuery

Available in this table: finngen-production-library.sandbox_tools_r12.kanta_r12_v1

Data columns

N.B. The raw data contains a MEASUREMENT_FREE_TEXT column that unfortunately cannot be directly released as it contains data that is potentially sensitive. It contains a mix of numerical measurement values, positive/negative outcomes, outcomes linked to thresholds (e.g. <3ml) and general notes. Our approach has been to extract such data from the original column through a process of cleaning and whitelisting of the field.

Overview

Kanta lab file

Column

Description

ETL

ROW_ID

Identifying number of entry

✓

FINNGENID

Study ID (Pseudonymised ID given to the FinnGen participant)

✓

SEX

Sex of the individual, female or male

✓

EVENT_AGE

Age (in years) at time of event, e.g. 12.012

✓

APPROX_EVENT_DATETIME

Date (randomized) and time (not randomized) of event, e.g. 2020-01-02T07:30 (see details)

✓

OMOP_CONCEPT_ID

OMOP Concept ID mapped from the TEST_ID and MEASUREMENT_UNIT

✓

TEST_NAME

Short name of the lab test, e.g. p-alat, s-tsh

✓

MEASUREMENT_VALUE_HARMONIZED

Value of the test measurement, after harmonization across the OMOP Concept ID

✓

MEASUREMENT_UNIT_HARMONIZED

Corresponding unit for the harmonized measurement value

✓

MEASUREMENT_VALUE_EXTRACTED

Value of the test measurement extracted from the MEASUREMENT_FREE_TEXT column

MEASUREMENT_VALUE_MERGED

Harmonized and extracted values merged together

TEST_OUTCOME

Label given for the outcome of the test to indicate how it falls against the reference range (see value table)

✓

TEST_OUTCOME_IMPUTED

Imputed test outcome (see value table)

✓

TEST_OUTCOME_TEXT_EXTRACTED

[<|>]|[VALUE]|[UNIT?] extracted from the MEASUREMENT_FREE_TEXT column

OUTCOME_POS_EXTRACTED

1(pos) or 0 (neg) outcome extracted from the MEASUREMENT_FREE_TEXT column

TEST_ID_IS_NATIONAL

Whether or not the TEST_ID is using the national lab test code system

✓

MEASUREMENT_VALUE

Value of the test measurement

✓

MEASUREMENT_UNIT

Corresponding unit for the test measurement

✓

MEASUREMENT_STATUS

Code indicating the status of the lab test measurement (see value table)

✓

REFERENCE_RANGE_GROUP

Reference range for this event, as text

✓

REFERENCE_RANGE_LOW_VALUE

Value for the low bound of the reference range

✓

REFERENCE_RANGE_LOW_UNIT

Corresponding unit for the low bound of the reference range

✓

REFERENCE_RANGE_HIGH_VALUE

Value for the high bound of the reference range

✓

REFERENCE_RANGE_HIGH_UNIT

Corresponding unit for the high bound of the reference range

✓

CODING_SYSTEM_ORG

Derived from CODING_SYSTEM_OID

✓

CODING_SYSTEM_OID

Original name: tutkimuskoodistonjarjestelmaid

✓

TEST_ID_SOURCE

Code of the lab test, as it appeared before preprocessing of the data

✓

TEST_NAME_SOURCE

Short name of the lab test, as it appeared before preprocessing of the data

✓

MEASUREMENT_VALUE_SOURCE

Value of the test measurement, as it appeared before data cleaning

✓

MEASUREMENT_UNIT_SOURCE

Unit of the test measurement, as it appeared before data cleaning

✓

Extended columns

On top of the core file there's another file containing metadata columns that are, for the most part, either empty or containing information we haven't quite been able to decipher yet. It also contains some source data (e.g. test name, source value, source unite) one can use to identify possible bugs in our pipeline. The two files can be merged via the row id column

Column

Description

ETL

ROW_ID

Identifying number of entry

✓

FINNGENID

Study ID (Pseudonymised ID given to the FinnGen participant)

✓

SEX

Sex of the individual, female or male

✓

EVENT_AGE

Age (in years) at time of event, e.g. 12.012

✓

APPROX_EVENT_DATETIME

Date (randomized) and time (not randomized) of event, e.g. 2020-01-02T07:30 (see details)

✓

OMOP_CONCEPT_ID

OMOP Concept ID mapped from the TEST_ID and MEASUREMENT_UNIT

✓

TEST_ID

Code of the lab test, as it appeared before preprocessing of the data

✓

TEST_ID_IS_NATIONAL

Whether or not the TEST_ID is using the national lab test code system

✓

TEST_NAME_SOURCE

Short name of the lab test, as it appeared before preprocessing of the data

✓

MEASUREMENT_VALUE_SOURCE

Value of the test measurement, as it appeared before data cleaning

✓

MEASUREMENT_UNIT_SOURCE

Unit of the test measurement, as it appeared before data cleaning

✓

MEASUREMENT_STATUS

Code indicating the status of the lab test measurement (see value table)

✓

REFERENCE_RANGE_GROUP

Reference range for this event, as text

✓

REFERENCE_RANGE_LOW_VALUE

Value for the low bound of the reference range

✓

REFERENCE_RANGE_LOW_UNIT

Corresponding unit for the low bound of the reference range

✓

REFERENCE_RANGE_HIGH_VALUE

Value for the high bound of the reference range

✓

REFERENCE_RANGE_HIGH_UNIT

Corresponding unit for the high bound of the reference range

✓

CODING_SYSTEM_ORG

Derived from CODING_SYSTEM_OID

✓

CODING_SYSTEM_OID

Original name: tutkimuskoodistonjarjestelmaid

✓

SERVICE_PROIVDER_OID

Probably the id of the place where the lab was taken/processed. Original name antaja_organisaatioid

`TEST_OUTCOME`

This column provides a label comparing the measured value against a reference range.

Value

Description

N

Normal

A

Abnormal

AA

Very abnormal

L

Low

LL

Very low

H

High

HH

Very high

`TEST_OUTCOME_IMPUTED`

Some rows are missing the TEST_OUTCOME, so an imputed one is provided. The TEST_OUTCOME_IMPUTED is derived by looking at the data from the same OMOP Concept ID for which there are MEASUREMENT_VALUE and TEST_OUTCOME for a minimum (100) number of entries. The process for determining the thresholds are as following. Values with both measurement (harmonized) and outcome are sorted by value, with the outcome labels sorted following the same order. E.g.

Value

OUTCOME

1.3

...

Starting from the lower end, we expect to find mostly low (L) values and then gradually find normal (N) ones. So in order to find the turnover point where Ns become the majority we define a relative measure of # of L entries/ all other entries. In ideal scenarios, this value starts at 100% and start to gradually decline as more Ns (or other entries, like A and H) start to appear. When the relative measure drops under 95% for the last time, we define the threhold there. The same is done from the oppoiste side with H. The summary of the threshold can be found in the repo. In the process we found two kind of anomalies, mainly due to an asymmetric distribution of labels:

+- inf thresholds. In these cases not enough labels are present at the tails of the distribution. In the algorithm the starting thresholds are defined as such, but they never get updated as the ratio of labels never climbs above the 95% threshold to begin with. This is usually associated with lab values where there is no such thing as L/H (e.g. Triglycerides) or where the labels used ar A instead of H|L
PROBLEM column. This boolean column indicates when the opposite issue appears, that is we traverse the whole list of values up to the median still being above the 95% threshold and the median value is therefore used as a threshold. This indicates that there's a heavy bias in the distribution of outcome labels and thus one should proceed with caution. Values imputed with these thresholds are labelled with a * , e.g. L* or H*

The content of the column is as following:

Value

Description

N

Imputed Normal

L

Imputed Low

L*

Imputed Low. Less confidence in the imputation due to over-representation of L and H from TEST_OUTCOME

H

Imputed High

H*

Imputed High. Less confidence in the imputation due to over-representation of L and H from TEST_OUTCOME

`MEASUREMENT_STATUS`

Value

Description

C

Corrected result

F

Final result

R

Unverified result

S

Partial result

Pipeline

The pipeline is available in github (https://github.com/FINNGEN/kanta_lab_preprocessing/) where technical information on how the raw data was processed can be found.

A quick summary:

duplicate entries are removed (based on id,date,lab test name/code/measurement status & value)
text is processed to remove spaces and strange characters
test national codes are mapped to names based on known mappings
units are cleaned/uniformized and mapped to OMOP based on lab test
units are harmonized based on OMOP IDs
Another duplication removal step takes place post harmonization to intercept duplicate entries from different systems (checking for ID,date,harmonized test name,value and status)

Values extraction analysis

A key aspect of the v2 kanta data has been the extraction of information from the MEASUREMENT_FREE_TEXT column. Here we want to explain how this took place.

Summary

The pipeline is available in github (https://github.com/FINNGEN/kanta_lab_preprocessing/) where technical information on how the raw data was processed can be found.

A quick summary:

the MEASUREMENT_FREE_TEXT column is manipulated to extract shareable information
- Where the original measurement value is missing and the free text is available, we attempt to extract numerical values from it if they match certain patterns. After some string manipulation if we're left with a pure number we cast it from string to float and is used to populate the MEASUREMENT_VALUE_EXTRACTED column
- The text is scanned for pos/neg substrings and through a manual mapping, values are mapped to 1 (pos) or 0 (neg) in a new OUTCOME_POS_EXTRACTED column
- The text is scanned to look for entries that indicate outcome as a comparison and are structured as such:
  - comparison (Yli/alle/</>)
  - numerical value
  - unit (potentially missing)
  These entries are manipulated in order to be standardized following the format [<|>]|[VALUE]|[UNIT?] so they can be shared safely.
QCing takes place to remove extracted values that are formatted as dates

Extraction Summary

In the following table one can find a summary of the free text extraction process.

OMOP

N_EXTRACTED

%_EXTRACTED

%_NA_MEASUREMENT

N_POSNEG

%_EXTRACTED

%_NA_OUTCOME

conceptName

OMOP ID

N of extracted numerical values

Percentage of numerical values extracted

Percentage of extracted values that had NA in raw data measurement

N of extracted POS/NEG values

Percentage of POS/NEG extracted values

Percentage of extracted values that had NA in raw data outcome

Concept Name

3026361

2095662

22.6799

100.0000

0.0000

100.0000

Erythrocytes [#/volume] in Blood

3018095

118284

22.3749

100.0000

67950

12.8536

6.2384

Leukocytes [#/volume] in Urine

141KB

extraction_summary_names.txt

Other reference tables

Test name abbreviations

Test name abbreviations come from different laboratory testing centers around Finland. Some are standardized nationally and some are used only locally in different hospitals and test centers.

We have put a lot of effort into standardizing these to international OHDSI OMOP Concept ID (primarily from LOINC) so we hope that you do not need to interpret them very often! However, in case you have reason to use them, we provide the meaning of most abbreviations here.

Prefixes for lab test name abbreviations

Arterial blood

Puncture fluid

Alveolar gas

Amniotic fluid

Ascitic fluid

Blood

Bronchus fluid

Bile

Bronchoalveolar lavation

Bone Marrow

Bone

Breast

Bursa

Cannula/IV port

Capillary blood

Cervix fluid

Central nervous system

Collected urine

Choroid villus

Dialysis fluid

Duodenal juice

Diurnal urine

Erythrocyte

Sputum

Fecal

Fasting blood

Vaginal fluor

Fasting plasma

Fasting serum

Gastrointestinal

Gastric juice

Hemoglobin

Heart

Kidney

Leukocytes

Lacrimal fluid

Likvor/CSF

Lymph Node

Liver

Lung

Lymphocytes

Muscle

Machine blood

Meconium

Mammary fluid

Maternal milk

Mucosa

Nerve

Nasal secretion

Nocturnal urine

plasma

Peritoneal dialysis

Pleura

Pituitary gland

Placenta

Periodontal pocket

Pharyngeal secretion

Patient

Pus

Serum

Saliva

Secretion

Skin

Semen

Sweat

Syncytial fluid

Thrombocyte

Tissue

Tumor

Urine

Umbilical arterial blood

Urogenital

Umbilical serum

Umbilical venous blood

Venous blood

Water

Suffixes for lab test name abbreviations

-Ab

Antibody

-AbA

IgA antibody

-AbE

IgE antibody

-AbG

IgG antibody

-AbM

IgM antibody

-Ag

Antigen

-Akt

Activity

-Aktt

Activation products

-Cl

Clearance

-Ct

Control

-D

DNA

-Di

Dialysis

-EVi

Special culture

-EM

Electron Microscopic

-F

Fetal

-Fc

Flow cytometry

-Fr

Fraction

-Gr

Gestational

-IF

Immunofluorescence

-IH

Immunohistochemistry

-Ind

Index

-Ion

Ionized

-Is

Iso enzymes

-ISH

in situ -hybridisation

-Jtk

Follow-up study

-Jvi

Follow-up culture

(jatkoviljely)

-Kj

Conjugate

-Lm

Species specificity

-MS

Mass spectrometry

-Nh

Nucleic acid

-O

Qualitative

-Oc

Oligoclonal

-Pa

Long term

-Pse

Screening and categorization

-PT

Rapid test

-R

Exercise stress test

-S

Stimulation

-Sc

Sub classes

-Ty

Typing

-V

Free or unconjugated

-Vi

Microbiology culture (e.g. u-Baktvi = bacterial culture from urine, ps-stravi = Strep A culture in pharayngeal secretion, F-sienVi = fungal culture in stool)

-Vit

Vitamine

-Vr

Staining

-Vt

Point of care (vieritesti), often a rapid test

Reference range terms

Test reference ranges are a free text string that can have a lot of Finnish in them. For those who don’t speak Finnish, we provide here translations of some of the common words you will see in reference ranges:

General terms:

AIKUISET: Adults
ALLE: Under/Below
ALK: Abbreviation for "alkaen", meaning "starting from" or "beginning at"
ALTISTUMATTOMAT: Unexposed (individuals)
AAMUNÄYTE: Morning sample
EDELLEEN: Still, continuing
FERTIILI-IKÄ: Fertile age
HOITOALUE: Treatment range
JA: And
JÄÄNNÖSPIT: Residual concentration
KAIKKI: All, everyone
KATSO: See, look at
KK: Abbreviation for "kuukausi", meaning month
KS: Abbreviation for "katso", meaning "see" or "look at"
KTS: Another abbreviation for "katso"
KYMENLAAKSONLAB: Kymenlaakso Laboratory (a specific lab in Finland)
LAPSET: Children
LEUK: Leukocytes (white blood cells)
LIER: Likely referring to "lieriöt", meaning casts (in urine analysis)
MIEHET: Men
NAISET: Women
NEGAT: Negative
NORMAALI: Normal
OHJEKIRJA: Manual, guidebook
PAASTO: Fasting
POJAT: Boys
POSTMENOPAUSSI: Postmenopausal
PREMENOPAUSAALISET: Premenopausal
PUBERT: Puberty
RASKAUS: Pregnancy
SUOSITELTAVA: Recommended
TAVOITE: Target, goal
TAVOITEARVO: Target value
TERAP: Therapeutic
TOKSINEN: Toxic
TULKINTA: Interpretation
TUPAKOIMATTOMAT: Non-smokers
TYTÖT: Girls
V: Abbreviation for "vuosi", meaning year
VASTASYNT: Newborn
VIITEARVO: Reference value
VKO: Abbreviation for "viikko", meaning week
VRK: Abbreviation for "vuorokausi", meaning day (24-hour period)
YLI: Over, above

Age-related terms:

0-6PV: 0-6 days
1KK-1V: 1 month to 1 year
1V-: 1 year and older
2-4V: 2-4 years
5-10V: 5-10 years
11-15V: 11-15 years
16V-: 16 years and older
18V-: 18 years and older

Medical terms:

ERYT: Erythrocytes (red blood cells)
EPIT.SOLUT: Epithelial cells
FOLLIKK.VAIHE: Follicular phase (of menstrual cycle)
MAKUU: Lying down (usually referring to blood pressure measurement)
MENARKEA: Menarche (first menstrual period)
PYSTY: Standing (usually referring to blood pressure measurement)

PreviousKanta lab values NextFAQ

Last updated 1 month ago

Was this helpful?