# Sandbox download requests – rules and examples for minimum N

**Background:** Due to data privacy reasons, only aggregate level data from at least 5 individuals are allowed to be exported from Sandbox. Thus, all subgroups used and visible in analysis results must have >=5 individuals to be allowed to download. No individual data points or IDs can be shown.

### Most common download request types

**1) Case/control analysis results** (for instance GWAS results run with SAIGE or REGENIE)

Are allowed to be exported if case and control groups have >=5 individuals. If there are columns that show genotype counts for each variant among cases and controls they can be kept even if the count is <5.

**2) Histograms**

Each bar shown should be from >=5 individual data. If the bar has other identifiers (like red colored area for females and blue colored area for males), these should also refer to sample groups of >=5 individuals.

**3) Curves**

Each curve should be drawn from >=5 individuals. For instance in survival curves the N will at some point fall <5 but this is OK if the entire group from which the curve has been drawn from has enough individuals. However, there should not be any other identifiers in the curve unless they also point to >=5 individuals (like colored areas). Please note that curves should not contain vertical bars or any other pointers that show individual events (however a staircase-like curve is allowed as illustrated in the below example).

*A slightly-modified real-life example of an approved curve (endpoint name, event details, SNP, and genotype counts changed)*

<figure><img src="/files/gIYhEo5qywOO3PC9eVNk" alt=""><figcaption></figcaption></figure>

**4) Scatter plots**

Are allowed if each dot is an average of at least 5 individuals. Sometimes instead of a scatter plot, you can consider something else, for instance a density plot.

Exception: PCA plots (where each data point reflects a single individual) are allowed to be downloaded if total N in plot is >=5.

*Slightly modified real-life example of an approved density plot for two variables*

<figure><img src="/files/HGLHOob9HecMvC0xh7Us" alt=""><figcaption></figcaption></figure>

**5) Pie charts**

Each section should derive from >=5 individual's data. If there are other identifiers such as the sector being further divided, the subparts should also be derived from >=5 individuals.

**6) Code**

Can be exported as long as code has "pure" commands only and not any table or header views of the data analyzed with it. If there are summary stats or counts or similar shown in the code or in comments, the minimum N should be reported. The code can't contain any FinnGen IDs, so they should be removed prior to export.

### **Some exceptions:**

**1) Allele/genotype frequencies and counts**

Such statistics are allowed to be exported even if the allele/genotype is present in <5 individuals.

The allele counts for SNPs within haplotypes should be derived from a minimum of >=5 individuals. This requirement extends to the haplotype frequencies as well. Please also note that any extra information from a haplotype group must also fulfill N>=5 rule (for instance case/control counts in a haplotype group).

**2) TBI files**

Binary files are usually not allowed, since admins do not have a way to check them. The tbi files are an exception to this and they can be exported if you have generated one for your summary statistics. However, please keep in mind that these can be generated also outside Sandbox with the data you have downloaded.

**3) Basic descriptive statistics**

Min/max/median/quartile values shown for instance in box plots often point to single individuals. These can still be currently exported but it is recommended that some fluctuation is added to them **especially** in cases where other data presented in the study/manuscript causes a danger that someone could be identified by combining multiple pieces of results. Values can be shown either via a boxplot or as a table of exact values. Note, however, that the boxplot should not contain any additional dots that point to single individual values (like outlier dots around min and max values)

*A slightly modified real-life example of an approved boxplot (endpoint name and case/control counts changed)*

<figure><img src="/files/B3gD4R6VtJ8IwomvT5y0" alt=""><figcaption></figcaption></figure>

*Imaginary example of basic descriptive statistics in a table (would be approved):*

| **ENDPOINT\_EVENT\_AGE** |       |
| ------------------------ | ----- |
| **min**                  | 2.56  |
| **1st quartile**         | 10.44 |
| **median**               | 34.23 |
| **mean**                 | 33.01 |
| **3rd quartile**         | 44.99 |
| **max**                  | 100.2 |

### **Additional things to consider:**

**1) Identify your subgroups correctly**

For instance you could be running a case-control analysis for different PRS bins. Then it is not enough to consider total amount of cases and controls but the groups within each bin should be also >=5.

**2) Limited results from <5 groups**

Results based on data from at least 5 individuals are generally considered anonymous and can be downloaded. We strongly recommend that all results should have N >= 5.

However, it is possible to export limited information of small groups (N < 5).\
If there is need for this please consider the following extra responsibilities:

* Evaluate if keeping small groups (N < 5 ) adds scientific value to your study.
* If there are results from small groups (N < 5), the exact sample counts and other related statistics (such as p-value, percentage and standard deviation) must be concealed by marking "< 5".
* User must make sure (before even placing a download request) that the concealed data from the small group (N < 5) cannot be deduced from other data in the table (such as total sum of N) or from other parts of the manuscript.

**3) Exception to the N>=5 rule**

The N>=5 rule is the primary guideline, and we strongly recommend using the N>=5 rule to ensure the anonymity of your results. In exceptional cases, the N>=3 rule may be applied - but only if it is essential for the analysis and scientific validity of the results.

Please note that applying the N>=3 rule does not override the requirement that only anonymized data may be downloaded from the Sandbox.

If you apply the N>=3 rule to your results, you must also provide an explanation of the measures taken to ensure anonymity. This ensures that the anonymity of the data is thoroughly assessed and properly documented. This explanation must be included as part of your download request.

**4) File formats**

Admins are able to inspect files that open in the terminal or with most common graphical programs such as excel and word. Files of other formats such as binary files will be rejected as we are not able to inspect them.

**5) What files do you actually need**

Please restrict your download request to the files that you really need. They all have to go through manual inspection and therefore keeping the files to an absolute minimum will help admins and you will receive the files faster.

**6) Timing**

Kindly note that we give support for download requests approximately from 08:00 to 16:00 Finnish time. There is no support on weekends or on public holidays. It will usually take a few working days to inspect your file, so last-minute requests will likely not reach you in time.

**7) Keep your zip and tar files clean**

If there are hidden files or folders or locked files in your request, it will be automatically rejected.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.finngen.fi/working-in-the-sandbox/quirks-and-features/how-to-download-results-from-your-ivm/sandbox-download-requests-rules-and-examples-for-minimum-n.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
