LogoLogo
FinnGen Handbook
  • Introduction
  • Where to begin
    • Quick guides
      • New to FinnGen
      • Green data users
      • Red data users
    • I'm new to FinnGen, where is the best place for me to start?
    • What kind of questions can I ask of FinnGen data?
    • How do I make a custom endpoint?
    • How do I run a GWAS of a phenotype I created myself?
    • I'm interested in FinnGen rare variant phenotypes
  • Background Concepts
    • Basics of Genetics
    • Linkage Disequilibrium (LD)
    • Genotype Imputation
    • Genotype Data Processing and Quality Control (QC)
    • GWAS Analysis
    • P Values
    • Heritability and genetic correlations
    • Finemapping
    • Conditional analysis
    • Colocalization
    • Using Polygenic Risk Scores
    • PheWAS analysis
    • Survival analysis
    • Longitudinal Data Analysis
    • GWAS Association to Biological Function
    • Genetic Data Resources outside FinnGen
    • Getting Started with Unix
    • Getting Started with R
    • Structure of the FinnGen project
    • Finnish gene pool and health register data
  • FinnGen Data Specifics
    • FinnGen Data Freezes and Releases
    • Analysis proposals
      • What is a FinnGen analysis proposal and when do I need to submit one?
      • How do I submit an analysis proposal?
      • How are analysis proposals handled?
      • What is a FinnGen bespoke analysis proposal and when do I need to submit one?
      • How do I submit a bespoke analysis proposal?
      • How are bespoke analysis proposals handled?
      • What is the difference between FinnGen analysis proposals and FinnGen bespoke analyses?
      • Existing analysis proposals
    • Finnish Health Registries and Medical Coding
      • Finnish health registries
      • Register data pre-processing
      • Data Masking/Blurring of Visit Dates
      • International and Finnish Health Code Sets
      • More information on health code sets
      • VNR code mapping to RxNorm
      • Register code translation files
    • Endpoints
      • FinnGen clinical endpoints
      • History of creating the FinnGen endpoints
      • Location of FinnGen Endpoint and Control Description Files
        • What's new in DF13 endpoints
        • What’s new in DF12 endpoints
        • What’s new in DF11 endpoints
        • What’s new in the DF10 endpoints
        • What’s new in DF9 endpoints
        • What’s new in DF8 endpoints
      • Interpretation of Endpoint Definition file
      • Location of Endpoint Quality Control Report
      • Creating a User-defined Endpoint(s)
      • Requesting a User-defined Endpoint to be included in Core Analysis
      • Complete follow-up time of the FinnGen registries – primary endpoint data
        • Survival analysis using the truncated endpoint file – secondary endpoint data
    • Biobanks in Finland
    • Publishing FinnGen results
      • Preparing manuscripts or conference abstracts
      • The 1-year “Exclusivity Period” Policy
      • List of Publications using FinnGen Data
      • How to share GWAS summary statistics with FinnGen community
      • How to publish GWAS summary statistics
      • Public Result Releases
    • Red Library Data (individual level data)
      • Genotype data
        • Genotype Arrays Used
          • Legacy cohorts and chips
        • Imputation Panel
          • Sisu v4 reference panel
          • Sisu v3 reference panel
          • Sisu v4.2 reference panel
            • Variant-wise QC metrics file
        • Genome build used in FinnGen
        • Genotype Data Processing Flow
        • Genotype Files in Sandbox
          • Imputed genotypes in VCF format
          • Imputed genotypes in BGEN format
          • Imputed genotypes in PLINK format
          • Chip data
          • Imputed HLA alleles
          • Principal components analysis (PCA) data
          • Kinship data
          • Analysis covariates
          • Polygenic risk scores (PRS)
          • Genetic Ancestry
          • Genetic relationships (GRM)
          • Mosaic chromosomal alterations (mCA)
          • Prune data (R9)
          • Imputed STR genotypes (R8)
      • Phenotype data
        • Register data
        • Detailed longitudinal data
          • Splitting combination codes in detailed longitudinal data
        • Service sector data
          • Service sector data code translations
        • Endpoint and endpoint longitudinal data
        • Kanta lab values
          • Data
          • FAQ
          • How-to guides
        • Kanta prescriptions
        • Minimum extended phenotype data
          • Extracting minimum phenotype data per biobank
          • DNA isolation protocols per biobank
        • Minimum longitudinal data
        • Minimum phenotype data (before R11)
        • Cohort data (before R11)
        • Other register data files in Sandbox
          • Register of Congenital Malformations
          • Finnish Registry for Kidney Diseases
          • Reproductive history data
          • Finnish Cancer Registry: Cervical cancer screening
          • Finnish Cancer Registry: Breast cancer screening
          • Finnish Cancer Registry: Detailed cancer data
          • Finnish Register of Visual Impairment
          • Parental cause of death data
          • Ejection fraction data
          • Finnish National Infectious Disease Register
          • Finnish National Vaccination Register
          • Covid-19 primary care data
          • Blood donor data from the Finnish Red Cross Blood Service (FRCBS)
          • Dental data
          • Socioeconomic data
          • Hilmo and avohilmo extended data
      • Omics data
        • Proteomics
          • Expansion Area 5 proteomics data
          • FinnGen 3 proteomics data
        • Metabolomics
        • Single-cell transcriptomics and immune profiling
        • High-content cell imaging
        • Full blood counts and clinical chemistry
      • Hospital administered medications
      • Whole exome sequencing (WES) data
    • Green Library Data (aggregate data)
      • What is "Green" Data?
      • Accessing Green Data
      • Other analyses available
        • Colocalizations in FinnGen
        • Autoreporting – information on overlaps
          • Index of Autoreporting variables
        • HLA
        • LoF burden test
        • Meta-analyses
      • Core analysis results files
        • Recessive GWAS results format
        • Variant annotation file format
        • Genotype cluster plots format
        • GWAS results format
        • Finemapping results format
        • Colocalization results format
          • Results format in colocalization before DF13
        • Autoreporting results format
        • Sex-specific GWAS results format
        • UKBB-FinnGen meta-analysis file formats
        • Pairwise endpoint genetic correlation format
        • Heritabilities
        • Coding variant associations format
        • HLA association results
        • Proteomics results
        • Coding variant results including CHIP EWAS (Exome-Wide Association Scan)
        • Kanta lab association results v1
    • Disease specific Task Force data
      • Inflammatory bowel disease (IBD) SNOMED codes data
    • Expansion Area 3 (EA3) studies
      • EA3 study: Fatty liver disease study and data in Sandbox
      • EA3 study: Age-related macular degeneration study and data in Sandbox
      • EA3 study: Women's health studies
        • EA3 study: Women’s health – Endometriosis and data in Sandbox
        • EA3 study: Human papilloma virus-related gynecological lesions, and data in Sandbox
        • EA3 study: Women’s health – PCOS and infertility study, and data in Sandbox
      • EA3 study: Diabetic Kidney Disease and Rare Kidney Disease study and data in Sandbox
      • EA3 study: Oncology studies
        • EA3 study: Oncology – Breast cancer study and data in Sandbox
        • EA3 study: Oncology –Prostate cancer study and data in Sandbox
        • EA3 study: Oncology – Ovarian cancer study and data in Sandbox
      • EA3 study: Pulmonary diseases (IPF, asthma and COPD) study and data in Sandbox
      • EA3 study: Immune-mediated diseases
      • EA3 study: Heart Failure study and data in Sandbox
      • FinnGen EA3 leads
  • Disease Specific Task Forces
    • Inflammatory bowel disease (IBD)
    • Kidney Diseases
    • Eye Diseases
    • Rheumatic Diseases
    • Atopic Dermatitis
    • Pulmonary Diseases
    • Neurological Diseases
    • Heart Failure
    • Fibrotic Diseases
    • Metabolic diseases
    • Parkinson's diseases
  • Working in the Sandbox
    • How to get started with Sandbox
    • What is Sandbox and what can you do there
    • What do we mean by "red" and "green" data?
    • General workflows for the most common analyses
    • Quirks and Features
      • Managing your files in Sandbox
      • Navigating the Sandbox
      • How to save Sandbox window configuration
      • Copying and pasting in and out of your IVM
      • How to report issues from within the Sandbox
      • Sharing individual-level data within the Sandbox
      • How to download results from your IVM
        • Sandbox download requests – rules and examples for minimum N
      • Keyboard combinations
      • Running analyses in your IVM vs. Pipelines
      • Timeouts and saving your work (backups, github)
      • How to install a R package into Sandbox?
        • How to install R packages with many dependencies
      • Install R and Python packages from the local Sandbox repository
      • How to install a Python package into Sandbox
      • How to install GNU Debian package
      • How to upload your own files to IVM via /finngen/green
      • How to remove files from /finngen/green
      • Using Sandbox as a Chrome application (full screen mode)
      • How to reset your finngen.fi account password
      • Sandbox IVM tool request handling policy
      • Docker images
        • How to get a new Docker image to Sandbox
        • How to mount data into Docker container image
        • Containers available to Sandbox
        • Containers with user customized tool sets
        • How to write a Docker file
        • Anaconda Python environment in the Sandbox
      • Python Virtual Environment in Sandbox
      • How to shut down your IVM
    • Which tools are available?
      • FinnGen exome query tool
      • Custom GWAS tools
        • Custom GWAS GUI tool
        • Custom GWAS command line (CLI) tool
          • Custom GWAS CLI Binary mode
          • Custom GWAS CLI Quantitative mode
        • How to make your summary stats viewable in a PheWeb-style?
        • Finemapping of Custom GWAS analyses
        • PheWeb Users Input Validator tool
        • Conditional analysis of Custom GWAS analyses
      • Pipelines
      • Pre-installed Linux tools
      • PGS Browser
      • Lmod Linux tools
      • Anaconda Python module with ready set of scientific packages
      • Python packages
      • R packages
      • Atlas
        • Quick guide
          • Introduction to OHDSI, OMOP CDM and Atlas
          • From research question to concepts and cohort building
          • Using Atlas in Sandbox
          • Examples on cohort building with Atlas
        • Detailed guide
          • Atlas data model
          • Standard and non-standard codes
          • How to define a cohort in Atlas
            • Select FinnGen data release in Atlas for Search
            • How to define a simple ICD case-control cohort in Atlas
              • Define a simple ICD Concept Set in Atlas
              • Define a simple ICD case cohort in Atlas
              • Define a simple ICD control cohort in Atlas
            • Concept Sets
              • Create Concept Sets using descendants
              • Exclude and Remove codes from Concept Set
              • Simplify Concept Sets that use standard code descendants
              • Create Concept Sets using equivalent standard and non-standard codes
              • View standard code hierarchy in Atlas
            • Cohort Definitions
              • Using the Death register in Atlas
              • Filtering by clinical registries in Atlas
              • Filtering by demographic criteria in Atlas
              • Defining exit rules for a cohort in Atlas
              • Selecting the correct box in Atlas for events and medical codes
            • How to export FinnGen IDs from Atlas
          • Downstream analyses after the Atlas cohorts are created
          • Data Release Summary Statistics in Atlas
          • Cohort Summary Statistics in Atlas
            • Time-dependent Cohort Summary Statistics in Atlas
            • Event inclusion in Cohort Summary Statistics in Atlas
          • Cohort Pathways
      • BigQuery (relational database)
      • Atlas vs BigQuery cohorts
      • Genotype Browser
      • Cohort Operations tool (CO)
        • Upload cohorts to CO
        • Combine cohorts with CO
        • Operate on Atlas cohorts and data with entries and exit events
        • Explore code and endpoint enrichments with CO (CodeWAS)
        • Explore endpoint overlaps with CO
        • Compare custom endpoint to FinnGen endpoint with CO
        • Launch custom GWAS with CO
        • Export FinnGen IDs using CO
        • Understanding phenotypic overlaps using CO
      • Trajectory Visualization Tool (TVT)
        • Running TVT
          • Filtering timelines with TVT
          • Reordering timelines with TVT
          • Clustering timelines with TVT
          • Viewing TVT results
        • Viewing Atlas, CO, and Genotype cohorts in TVT
        • Exporting cohorts from TVT
        • TVT help page
      • LifeTrack
      • Miscellaneous helper scripts/tools
        • Tool to annotate variants with RSIDs
        • Proper translations of medical, service sector and provider codes
        • BigQuery Connection – R
          • Case study – All register data for a person
          • Case study – UpSet plot
          • Case study – Tornado plot
          • Case study – defining simple cohorts using medical codes for running case-control GWAS
        • BigQuery Connection - Python
          • BigQuery Python - Downstream analysis - Active Ingredient - Bar plot
          • BigQuery Python - Case Study - Sex different - Tornado plot
          • BigQuery Python - Case Study - Comorbidity - Upset plot
          • BigQuery Python - Case Study - Patient Timeline - Scatter plot
      • Sandbox internal API for software developers
    • Working with Phenotype Data
      • Variant PheWas
      • How to select controls for your cases
      • Using the R libraries to look at Phenotype data
      • How to check case counts from the data
      • Creating your own user-defined endpoint
    • Working with Genotype Data
      • Genotype Browser how to
      • Cluster Plots
      • ClusterPlot viewer V3C
      • Rare Variant Calling in V3C
      • Create map of allele
      • Genotypes from VCF files
      • Variant PheWas
      • Interpreting rare-variant analysis results
      • Tools for geno-pheno explorations
        • Example: transferring data from Genotype Browser to LifeTrack
        • Example: Visualizing Genotype Browser output data with TVT
    • Running analyses in Sandbox
      • How to run survival analyses
      • How to create custom endpoint using bigquery: example
      • How to use the Pipelines tool
      • How to submit a pipeline from the command line (finngen-cli)
      • How to run genome-wide association studies (GWAS)
        • How to run GWAS using REGENIE
        • Running quantitative GWAS with REGENIE
        • Conditional analysis
        • Conditional Analysis with custom regions and loci
        • How to run GWAS using SAIGE
        • Adding new covariates in GWAS using REGENIE and SAIGE
        • How to run GWAS using plink2 (for unrelated individuals only)
        • How to run GWAS using GATE (survival models)
        • How to run trajGWAS
        • How to run GWAS using the Regenie unmodifiable pipeline
        • How to run an interaction GWAS using the Regenie unmodifiable pipeline
        • How to run survival analysis using GATE unmodifiable pipeline
        • How to run GWAS on imputed HLA alleles using Regenie
      • How to run finemapping pipeline
        • Finemapping with custom regions in DF12
        • Unmodifiable Finemapping pipeline
      • How to run colocalization pipeline
      • How to run the LDSC pipeline
      • How to run PRS pipeline
      • How to calculate PRS weights for FinnGen data
      • Sandbox path and pipeline mappings
      • If your pipeline job fails
      • Tips on how to find a pipeline job ID
      • Managing memory in Sandbox and data filtering tips
      • Using Google Life Sciences API in Sandbox
      • Pipelines is based on Cromwell and WDL
    • Billing information and where to find more details
      • Monitoring Sandbox costs by Sandbox billing report
      • Monitoring Sandbox costs directly from your Google billing account
  • Working outside the Sandbox
    • Risteys
    • Endpoint Browser
    • PheWeb
      • Volcano plots with LAVAA
    • Meta-analysis PheWeb(s)
    • Coding variant browser
    • Multiple Manhattan Plot (MMP)
      • How to prepare an input file for MMP
      • How to use MMP
    • LD browser
    • Green library data
  • FAQ
    • FinnGen Spin Offs
    • FinnGen access and accounts
      • How do I apply for data access?
      • What is "red" or "green" data?
      • I already have green data access, how do I apply for red data access?
      • I cannot access the /finngen/red?
      • How do I enable two-factor authentication (2FA)?
      • I cannot access my FinnGen account?
      • How to reset account credentials
      • What to do if you suspect your account has been compromised
      • Can't access your smartphone for 2FA?
      • How do I access the FinnGen members' area?
      • How do I access FinnGen All Sharepoint?
      • How can I view existing analysis proposals?
      • How can I join the FinnGen Slack?
      • How do I join the FinnGen Teams group?
      • How to apply SES sandbox access
      • How to request a FinnGen account?
    • FinnGen data
      • What to do if I think I found a mistake in the data?
      • What are the field/column names in FinnGen?
      • What covariates are used in FinnGen's core GWAS analyses?
      • Does FinnGen have lab results available?
      • Does FinnGen have family and relatedness information available?
      • Where can I find a list of unrelated individuals in FinnGen?
      • When moving from BCOR to .txt files, what does the column called "correlation" mean?
      • Is there really no participant birth year data?
      • How do I calculate time between events?
      • Can I select only the columns needed for my analysis to import into RStudio?
      • What is the difference is between LD-clumping and the Saige conditional analysis?
      • Can I download all pairwise LD data across the genome at once?
      • How to find latest data releases?
      • Why are there differences in the GWAS results between Data Freezes/Releases?
    • Where can I find
      • COVID association results?
      • Users' Meeting materials?
      • A list of what coding variants are enriched in Finland?
      • A comprehensive list of key file locations in FinnGen?
      • Medical code translations?
    • PheWeb
      • What are QQ and Manhattan plots?
      • How can I access PheWeb?
      • Are fine-mapping results that available in PheWeb also available as flat files?
      • Do the autoreports report the 95% or 99% credible set?
    • Registries
      • What do KELA reimbursement codes map to?
      • What's the cutoff date for FinnGen data?
    • Sandbox
      • What is the FinnGen Sandbox?
      • Why does my IVM freeze while loading data into R/Rstudio
      • Where can I find tutorials and documentation on Sandbox?
      • How do I get my own analysis code into Sandbox?
      • Where to ask for software you'd like to see in Sandbox
      • Can I share individual level data between different Sandbox users?
      • Is there a sun grid engine for running long scripts?
      • How to clear browser cache after sandbox update
      • How do I increase the window resolution on my IVM?
      • How can I view pdf, jpg and HTML files?
      • My Sandbox job was killed - why?
      • How to unzip files in the command line
      • Why aren't my keyboard/shortcuts working in Sandbox like they do in my local computer?
      • How to know if my pipeline job was failed due preemption of worker VM
    • Risteys
      • Why is the case number dropping after the "Check pre-conditions, main-only, mode, ICD version" step?
    • Endpoints
      • Where do I find the most recent list of FinnGen endpoints?
      • What does it mean when an endpoint has “mode” at the end?
      • What scenario would cause an NA (missing data) entry rather than a zero?
      • Does it mean anything when a value is written as $!$ instead of NA?
      • Why is there an inconsistency between ICD10 code J84.1 (IPF) and J84.112?
      • How are control endpoints calculated?
      • Can I get a list of FinnGen IDs by control group for my endpoint?
      • What does Level C mean in the endpoints data table?
      • What does the SUBSET_COV field show?
      • Why is there a "K." prefix on some endpoints?
      • Why there are fewer endpoints going from R5 (N = 2,925) to R8 (N = 2,202)?
      • Should I include primary care registry (PRIM_OUT) codes in my cohort definitions?
      • I found BL_AGE after FU_END_AGE in the endpoint data, how is it possible?
      • Why do individuals who are not dead have death age in endpoint data?
      • I found EVENT_AGE after FU_END_AGE in endpoint data, how is it possible?
    • Pipelines
      • Are there example SAIGE pipelines?
      • How do I apply finemapping to my SAIGE results?
      • Why Pipelines is claiming that my files or folders are not in /finngen/red?
    • Citing
      • How do I cite analysis using publicly available FinnGen results?
      • How do I cite FinnGen results that use individual level data?
    • For biobanks
      • How to apply for data return
    • Data Security and Protection
      • How do I report a data breach?
  • Release Notes
    • Data Releases 2025
    • Data Releases 2024
    • Data Releases 2023
    • Data Releases 2022
    • Data Releases 2021
  • Tool Catalog
  • Glossary
  • User Support
  • Data Protection & Security
Powered by GitBook
On this page
  • Prior to FinnGen
  • FinnGen
  • FinnGen, further developments

Was this helpful?

  1. FinnGen Data Specifics
  2. Endpoints

History of creating the FinnGen endpoints

Concept, definitions and format, register data processing and actual endpoint algorithms

PreviousFinnGen clinical endpointsNextLocation of FinnGen Endpoint and Control Description Files

Last updated 1 year ago

Was this helpful?

Dr. Aki Havulinna, MD Tuomo Kiiskinen, Dr. Susanna Lemmelä, Sami Koskelainen, Dr. Tero Hiekkalinna, Dr. Elisa Lahtela, Prof. Hannele Laivuori

The vision by Dr. Havulinna: create a comprehensive set of harmonized endpoints on diseases and health related conditions, covering the whole ICD-10. Provide tools to harmonize and preprocess the data and plan and implement the algorithm that creates the actual endpoints from the definitions and register data. This work has been and will be openly available to benefit the whole Finnish and international clinical and medical research community, not only the FinnGen project.

Prior to FinnGen

The root of the endpoint concept lies in the work of Dr. Havulinna and prof. Veikko Salomaa since 2006. We needed some harmonized, multiregister-based cardiometabolic endpoints for our research work with the FINRISK data (N=30 000). The multiple registers included were register for healthcare (hospital discharges, special care outpatient visits, surgical operations and procedures), causes-of-death, KELA registers of medicine purchases and drug reimbursements, and cancer register. These registers cover almost half a decade of data, during which, e.g., Finnish specific ICD versions 8,9 and 10 have been used. Therefore, harmonization of the data and endpoint definitions was required.

The few cardiometabolic endpoints were soon expanded to a dozen. Havulinna created a set of SAS macros with which the endpoint rules for each endpoint were manually programmed in, and the macros would be applicable to different data sets (e.g., separate FINRISK survey years). Soon, even this approach was too tedious for continuously adding new endpoints.

Havulinna decided to create a systematic approach where endpoint definitions would be entered in a simple structured document and processed automatically by an R-script to create the actual register-based endpoints. This was the beginning of the endpoint definition excel and the original FINRISK endpoint scripts. Around the year 2016 these concepts and scripts were used in the FIMM/THL/Pharma collaboration project which was a pilot/prequel to the FinnGen project.

Some 200 endpoints were drafted by prof. Hannele Laivuori and prof. Markus Perola, based on the suggestions by participating pharma companies. Havulinna formalized the endpoint definitions in the excel format, and together with Dr. Mervi Kinnunen and bioinformatician Elina Kilpeläinen we ran the R-scripts to create the endpoints. Havulinna heavily improved and modified the scripts to gain speed and cope with various data related issues. The endpoint concept and scripts were now ready for a major new challenge.

FinnGen

In FinnGen the endpoint goals were set as follows (by Havulinna and Laivuori as leaders of the Clinical team):

1. The primary endpoints: Pharmaceutical companies each was asked to provide a list of ~10 of their main interest endpoints

  • a. We divided the listed endpoints into disease categories (ICD-10 chapters)

  • b. For each disease category we established a clinical expert group of Finnish medical scholars and pharma representatives. The expert groups are listed at , but the structure of the groups has changed; in the original format, there were about a dozen group members besides the lead and secretary: experts representing all Finnish university hospitals, and Pharma companies with interest in the diseases in question. Besides the six original expert groups – Neurology, Gastroenterology, Rheumatology, Pulmonary diseases, Cardiometabolic diseases, Oncology – several new groups have emerged later-on. Original member lists can be seen in the list of collaborators in earlier FinnGen publications.

  • c. The expert groups helped in creating and fine-tuning the endpoints of interest, with varying amount of contribution by each group.

2. PheWAS approach: This constitutes the bulk of the endpoints. Given the large amount of data to be collected, and overrepresentation of diseased individuals (due to FinnGen samples being based on a major part on hospital biobank samples) we wanted to create as wide a range of disease endpoints as possible – e.g., for a hypothesis free study of genetic association of diseases.

Next, doctoral researcher, MD Tuomo Kiiskinen joined the clinical team. This was the beginning of the huge job to create the FinnGen endpoint library for PheWAS. We proceeded by adding one ICD-10 chapter at a time, prioritizing more important (to FinnGen) chapters. The work by Kiiskinen for one chapter lasted 3-4 weeks, after which Havulinna made initial semi-automated checks to ensure the consistency of the hierarchical structure, and obvious errors in diagnosis codes or other things. The original approach (by Kiiskinen) for each chapter was as follows:

  1. Manually match the ICD-10 code with Finnish ICD-9 and ICD-8. A 1:1:1 match is usually impossible, so for every endpoint there was a decision whether the ICD-10 structure should be modified, usually by combining codes, or whether the earlier versions were so outdated compred to ICD-10 that the ICD-8/9 codes were dropped into the NAS category)

  2. For every endpoint this was NOT straightforward; besides medical knowledge it required studying the diseases (Terveysportti, Wikipedia, literature, etc.) to make the best decisions, so it really took a lot of time.

  3. See if there are any specific drug reimbursement codes that would cover these endpoints

  4. See if there are any disease specific drug purchase ATC-codes that would match these endpoints

  5. Check if clinical groups had any specific requests and either a) modify the already made endpoints b) create these requests as additional “custom/design” endpoints (=composite endpoints)

  6. Submit the work to Havulinna for initial checks, potential corrections and for running the actual endpoints in the FINRISK data, to see that everything works and how the endpoint case distributions look like

  7. Present the work (Kiiskinen, Havulinna) to the primary clinical expert group and to others.

At this point we did not have the Finnish ICD-8 or ICD-9 in an electronic format, which would have helped a lot. We only had PDF-copies of the original books, scanned and processed by Havulinna:

Our first release of the endpoint library (January, 2018), for FinnGen DF1 contained 2057 endpoints covering the ICD-10 Chapters 1-14. The endpoint algorithm already had the “INCLUDE” and “CONDITION” rules, and possibility to create sex-specific endpoints. We also provided for each endpoint the control exclusion/eligibility rules, which were determined by Havulinna, mainly algorithmically to exclude closely resembling diseases from controls.

Kiiskinen and Havulinna, with support from Laivuori, provided a unique combination of expertise, without which the FinnGen endpoint library would not exist.

FinnGen, further developments

For each consecutive data freeze (DF) we refined existing endpoints where problems were found, and added new endpoints based on the suggestions from the FinnGen community. After the first FinnGen DFs, we improved some endpoints from GWAS-based experience, e.g., exclusion of T1D from T2D cases (the overlap was detected because of a clear HLA signal in T2D GWAS; due to their autoimmune nature, HLA associations are known to be specific to T1D), and a similar thing happened with UC vs Crohn’s disease.

The Covid era from 2020 onwards led to several changes also for the FinnGen endpoints. Kiiskinen had left after DF5 to focus on pursuing his PhD. Havulinna did the DF6 endpoint definition update. He added the missing ICD-10 chapters, 15-22. Chapter 15 (Pregnancy, childbirth and the puerperium) was harmonized by Laivuori and Havulinna, while chapters 16-22 still remain unharmonized.

Also, during DF6, Dr. Tero Hiekkalinna started programming the Endpointter, an alternative implementation of Havulinna’s endpoint algorithm written in Python. Endpointter was written in the perspective of the detailed longitudinal source data, whereas the original R-scripts were written for the wide-format register data as received from the original register authorities. Havulinna has also prepared a new version of the R-scripts, adapted for the detailed longitudinal register data. Starting in Jan 2021 (DF7), the endpoints have been created using the Endpointter scripts. Starting with DF6v3, bioinformatician Sami Koskelainen took over the endpoint data creation process, running the endpoint scripts.

Dr. Elisa Lahtela joined FinnGen and the Clinical team in April 2020, and in DF7 took over the Endpoint definition updating tasks from Havulinna, who (along with Laivuori) switched mainly to an advisory role when FinnGen2 started (August 2020). Endpoint changes were for the major part frozen, and during DF7-8 we cleaned the endpoint set from redundant endpoints and decided on a core set of endpoints to avoid doing an expensive GWAS for closely correlated endpoints.

Follow roughly the ICD-10 treelike structure - to the level of detail that would still make sense in the context of genetic analyses. For example, usually the .8 is "other specified", and .9 is "unspecified" so they were always combined into Other / unspecified (=non aliter specificatus, NAS), because if the "other" is not specified it is also equal to unspecified).

ICD-9:

ICD-8:

For each DF Havulinna also improved his endpoint algorithm written as R scripts, introduced some new concepts there, and usually also created the actual endpoints and preprocessed the register data. Register team lead at that time, Dr. Kati Kristiansson ran the endpoint scripts for some DFs, with Havulinna doing debugging. Dr. Susanna Lemmelä joined the Clinical team and the Register team in Feb. 2019, and she took over much of the processing and actual creation, since DF3.

The original FinnGen study permission for several DFs was restricted so that we could only release the derived endpoints based on several registers, and not any original register data which was kept within the FinnGen register team at THL. Since DF2 we have released , containing all events and source register information, besides the original first-ever event release.

The original endpoint scripts utilized the original register data in the wide format, rather unchanged. Few things, such as EVENT_AGE, were added in the preprocessing. Starting in 2019, when the register permission became more liberal, so that the diagnosis codes could also be released to the FinnGen sandbox (a secure computing environment, which allows researchers from all over the world to do research and analyses on the enriched data of the FinnGen project), we discussed a new, harmonized longitudinal register data format which would contain only the essentials (e.g., source register, age of diagnosis, diagnosis codes) to allow easy browsing of the events even without endpoint scripts. Drs. Andrea Ganna and Juha Karjalainen participated in formulating the concept, and Lemmelä quickly prepared the necessary detailed longitudinal data scripts with Havulinna helping. has been released using these detailed longitudinal data scripts since mid-2019 (DF4). It took some time before the longitudinal data format was adopted as the basis for endpoint creation, as it required a complete rewrite of the original endpoint scripts.

Since DF3 we have also committed to improved QC, in collaboration with the register and analysis teams. Lemmelä created quality control R-scripts for the endpoint and endpoint longitudinal data and ran case/control correlations, Jaccard indices and clusters for the FinnGen endpoints. Since DF7, Havulinna has written a set of R-scripts for automatically updating the endpoints by a structured Excel containing the changes; for checking the consistency of the endpoints, including the hierarchy; and for creating an endpoint definition change log. The register team now follows and updates a rigorous QC procedure to ensure the quality of the register data, the endpoint definitions, and the endpoints created from the register data. Starting in DF9-10, Koskelainen with help from Lemmelä, who is a product owner of the , has taken the main responsibility for the whole FinnGen endpoint data creation process.

https://www.finngen.fi/en/clinical_expert_groups
https://icd.who.int/browse10/2019/en#/
http://urn.fi/URN:NBN:fi-fe201701261356
http://urn.fi/URN:NBN:fi-fe201710058910
register data
endpoint data
endpoint longitudinal data
detailed longitudinal data
Detailed longitudinal data
FinnGen Clinical endpoints