FAQ

Explanations for common questions about the Kanta lab values dataset

Q: Which years are covered in the data?

The data covers the years 2014 to 2023. However, not all lab data providers sent their data right from 2014. Most of the data comes online starting in 2018.

Q: What part of the FinnGen cohort also has data in the Kanta Lab dataset?

The Kanta Lab data was requested for all FinnGen participants. However, the Kanta Lab data spans from 2014 to 2023 so only FinnGen participants alive during this period and having lab records in Kanta are in the dataset.

In total, there are 482k FinnGen participants with data in the Kanta Lab dataset and an average of 482 tests/individual.

Q: Why are some rows missing both the measurement value and the test outcome?

Even if some rows are missing both the measurement value and the test outcome, we decided to retain these as there might still be information in knowing the test was ordered. There are a few reasons we are aware of that these can occur:

The test abbreviation refers to a panel of tests, such as a blood count/CBC panel. In this situation, the abbreviation lets you know the whole panel was run. Still, specific measurements and outcomes such as (H)igh, (L)ow or (A)bnormal would only be available for specific tests on red blood cells, white blood cells, hemoglobin, etc.

Q: How do we know the reference range for each test?

Reference range is a free text string that depends on:

age of the individual
sex of the individual
chemistry/detection method used by the particular lab
units of the test
date of the test (ranges may change over time or due to updated public health guidelines for tests such as cholesterol)

Unfortunately, only 1/3 of the events include a reference range specific to these combined values. Because so many tests lack this information we have created an imputed range for each OMOP ID. We have computed these based on events where the MEASUREMENT_VALUE, MEASUREMENT_UNIT and TEST_OUTCOME are present for the event. Please see below for more about this.

Q: How reliable are the time measurements from `APPROX_EVENT_DATETIME` in the data?

The most common times for a test are 7:00 a.m. and 7:01 a.m., this is likely the time at which tests are ordered, but may be carried out throughout the morning. Other test times should be a more reliable indicator of when the test was actually taken.

While the date component of the APPROX_EVENT_DATETIME column is randomized for privacy blurring (stable number of days per FINNGENID), the hour:minutes component comes from the raw data and is not randomized.

Q: Can I differentiate between the time a test is ordered vs. the time is test is taken?

No. The original data is coming from different sources (different lab centers, different IT systems, etc.) and has undergone several data processing stages before reaching FinnGen. So, unfortunately, when the raw data has reached FinnGen we do not know to what the time information relates to, and is most likely inconsitent depending on the original data source.

Q: What is the meaning of the different `TEST_OUTCOME` and `TEST_OUTCOME_IMPUTED` values?

TEST_OUTCOME values may be recorded as:

(N) ormal

(L) & (LL) for low or very low

(H) & (HH) for high and very high

(A) & (AA) for abnormal and very abnormal. Note that if a test has a normal range between 10–20 that (A)bnormal can be marked for results <10 as well as >20.

(NA) in our QC of the data when the outcome is missing and listed as NA, most of the measured values are in the normal range; however, there is no guarantee of that.

TEST_OUTCOME_IMPUTED

Unfortunately, the raw data did not include defined reference ranges for all the lab tests. To approximate the ranges we have looked at tests that have the trio of measurement value, measurement units and outcomes and approximated a low and high for each type of lab tests. We have then applied these "imputed" high and low values to score tests which did not have an outcome.

Q: How do I know if a test was positive or negative, say for Covid?

Positive and negative are not concepts that exist in the raw data. In most cases you would want to look for N (Normal) vs. A (abnormal), AA (very abnormal) or H (High), HH (Very high) as analogous to negative vs. positive.

Q: What is "vierimittaus"?

Vierimittaus refers to point-of-care testing (POCT), also called near-patient testing or bedside testing. It refers to sampling, whether the sample was taken at laboratory, bedside or by patient themself.

Q: When will the Kanta Lab dataset be updated?

The mapping will be updated for the DF13 release in February 2025. However, we do not expect to add any additional lab results until DF14 in February 2026.

Q: Should I use `MEASUREMENT_VALUE` or `MEASUREMENT_VALUE_HARMONIZED` for my analysis?

MEASUREMENT_VALUE_HARMONIZED is the best one to use. Different labs may deliver results for the same type of test with different units. When we map lab results from different labs, we harmonize the units to the OMOP standard, and we apply a conversion factor so they are all expressed in the same units, which generates MEASUREMENT_VALUE_HARMONIZED. MEASUREMENT_VALUE is provided more as a safeguard in case there are any problems in mapping or conversion factor, you would still have access to the raw value.

Q: Is it possible to know which test center processed the sample and made the test measurement?

No. Unfortunately we do not know which test center performed the test in the data that FinnGen received, so this information is also not currently available for FinnGen users.

Q: Why are some rows duplicates by `FINNGENID`, `APPROX_EVENT_DATETIME`, `OMOP_CONCEPT_ID` and `MEASUREMENT_VALUE`?

We have preprocessed the Kanta lab data to make it as clean and usable as possible, but some oddities remain. We identified that ~0.7% of the rows in the Kanta lab dataset are duplicates by FINNGENID, APPROX_EVENT_DATETIME, OMOP_CONCEPT_ID and MEASUREMENT_VALUE.

We found that one reason for this duplication is that in the raw data the same record appears once with test ID referring to a local lab test code system, and then once more with its test ID referring to the national lab test code system.

We chose to keep these rows as it is not clear which of the duplicate rows to keep, and we have not yet done a systematic investigation on the origins of the row duplication.

Q: Some of the units are unfamiliar to me, how do I know what they are?

One common one you will see is e9 - in blood cell counts, "e9/L" means "10^9 per liter" or "billion per liter". For example, a white blood cell count of 5.2 e9/L means 5.2 billion cells per liter. This unit is part of the International System of Units (SI) and is widely used in many countries for standardizing laboratory results. It's equivalent to the older unit "G/L" (giga per liter) or "10^9/L". Similarly, e12/L would mean "10^12 per liter" or "trillion per liter".

The full list of units can be found here - https://github.com/FINNGEN/kanta_lab_harmonisation_public/blob/main/MAPPING_TABLES/UNITSfi.usagi.csv

Q: I have found some negative measurement values, are they a mistake?

Actually, there are a number of lab tests that are expected to have negative values! Checking for negative values is something that our OHDSI Achilles QC testing suite examines and any negative values you see in the data have been checked and should be realistic. Here are some tests that are expected to have negative values:

Calcium Balance: calcium balance studies can show negative values when there's more calcium excretion than intake.
Acid-Base Balance: Tests like Base Excess (BE) can be negative, indicating metabolic acidosis.
Iron Balance Studies: Like calcium, iron balance can be negative if more iron is lost than absorbed.
Anion Gap: While usually positive, it can be negative in rare cases like bromide intoxication.

Q: How was the Kanta data preprocessed before it got to FinnGen users?

The data is originally extracted from Kanta and sent THL for pseudonymesation. Then, THL send the pseudonymised data to FinnGen. At this stage, FinnGen does a preprocessing step that includes: row deduplication, column subsetting, data cleaning, harmonization with OMOP, and more. Extensive documentation about the preprocessing of the Kanta lab data by FinnGen can be found here: https://github.com/FINNGEN/kanta_lab_preprocessing/tree/master.

PreviousData NextHow-to guides

Last updated 1 year ago

Was this helpful?

hashtagQ: Which years are covered in the data?

hashtagQ: What part of the FinnGen cohort also has data in the Kanta Lab dataset?

hashtagQ: Why are some rows missing both the measurement value and the test outcome?

hashtagQ: How do we know the reference range for each test?

hashtagQ: How reliable are the time measurements from APPROX_EVENT_DATETIME in the data?

hashtagQ: Can I differentiate between the time a test is ordered vs. the time is test is taken?

hashtagQ: What is the meaning of the different TEST_OUTCOME and TEST_OUTCOME_IMPUTED values?

hashtagQ: How do I know if a test was positive or negative, say for Covid?

hashtagQ: What is "vierimittaus"?

hashtagQ: When will the Kanta Lab dataset be updated?

hashtagQ: Should I use MEASUREMENT_VALUE or MEASUREMENT_VALUE_HARMONIZED for my analysis?

hashtagQ: Is it possible to know which test center processed the sample and made the test measurement?

hashtagQ: Why are some rows duplicates by FINNGENID, APPROX_EVENT_DATETIME, OMOP_CONCEPT_ID and MEASUREMENT_VALUE?

hashtagQ: Some of the units are unfamiliar to me, how do I know what they are?

hashtagQ: I have found some negative measurement values, are they a mistake?

hashtagQ: How was the Kanta data preprocessed before it got to FinnGen users?