Understanding the Cancer Registration Dataset#

Last modified: 27 Feb 2026

The cancer dataset is a record of all registered cancer diagnoses in England.

1. Introduction#

The Cancer Registration Dataset was introduced in 1971. It records information on all cancer diagnoses, including date of diagnosis and the type and behaviour of each cancer.

2. Strengths of the cancer registration dataset#

  1. It provides complete coverage of all cancer diagnoses in England

  2. It uses a standardised coding systems, enabling comparative work to be undertaken over time and internationally

  3. The dataset includes diagnostic information such as cancer type and stage, and date of diagnosis

  4. Cancer registration data are validated and checked for completeness, meaning the dataset is a reliable representation of all cancer diagnoses in a given period

3. Limitations of the cancer registration dataset#

  1. It is a clinical dataset, rather than one specifically designed for research

  2. There is no information on treatments or test results

  3. The thorough validation and completeness checks mean there can be a delay of up to 24 months before a diagnosis appears in the registry data

  4. Information on disease recurrence or progression is not included

4. Scope and coverage#

The dataset contains patient-level information about:

  • tumour details (site, behaviour, staging)

  • date of diagnosis

  • patients’ sex, age and geographic location

5. Data collection methodology#

Data are collected and collated from a range of clinical sources. These include: Hospital Episode Statistics (HES), lab reports, screening programmes, primary care prescriptions and death certificates.

6. Structure of the dataset#

The dataset is diagnosis-level with each record (line) representing a single diagnosis. For individuals diagnosed with more than one type of cancer, each diagnosis is on a separate line.

7. Coding systems used#

The dataset uses two main medical coding systems: ICD-10 for diagnoses and ICD-O-3 for morphology and behaviour.

ICD-10 (International Statistical Classification of Diseases and Related Health Problems) contains 22 hierarchical chapters, based on body systems. All cancers are recorded in Chapter II ‘Neoplasms’.

ICD-O-3 (International Classification of Diseases for Oncology) is used to code the morphology, behaviour and grading of neoplasms.

These two coding systems were introduced to the cancer registry in 1995, superseding ICD-9 (previously ICD-8) and ICD-O-2 (previously ICD-O-1).

8. Evolution of the dataset#

The cancer registry was established in 1971. It was initially based on snapshots of incidence counts by age, sex, cancer site and type, drawn from paper records. In the 2010s it became event-based and captured every diagnosis, drawing from multiple sources of health data.

9. Availability in the UK LLC TRE#

The UK LLC TRE holds an extract of the cancer registration dataset, going back to 1971 when it was first established. The cancer registration dataset records of participants in UK LLC’s partner LPS, where individual or LPS permissions allow linkage to NHS data, are included in the TRE. UK LLC does not hold any information about people who are not part of a partner LPS or about LPS participants who have requested that their NHSE data not be shared via UK LLC.

More detailed information about the UK LLC’s cancer registration extract is here.

10. Missing information#

  • Variable and value labels
    UK LLC is infilling missing variable and value labels in the NHSE datasets in the TRE. Where variable labels have been added by UK LLC, rather than being found in NHSE documentation, this is made apparent by the phrase ‘label added by UK LLC’ being included in the variable label.

  • Missing data
    The amount of missing data varies widely between variables and across datasets. Throughout 2026, we will update this section with information about missingness in the cancer dataset.

11. Tips for researchers using cancer registration dataset in the UK LLC TRE#

When applying to access linked cancer registration data in the UK LLC TRE, researchers must submit a codelist specifying the ICD-10 codes which are relevant to their research question.

In the dataset, ICD-10 codes are provided with up to 4 characters (e.g. C504). ICD-O-3 codes are split into two fields: cancer type and cancer behaviour (see Table below).

Key variables in the cancer registration dataset

Variable name

Variable label

Description

cancer_site

Site of cancer

ICD-9 or ICD-10 code

cancer_type

Cell type (histology)

First 4 digits of an ICD-O morphology code

cancer_behaviour

Cancer behaviour

Final (5th) digit of an ICD-O morphology code

12. Useful syntax#

Below we will include syntax that may be helpful to other researchers in the UK LLC TRE. For longer scripts, we will include a snippet of the code plus a link to the UK LLC Github repository where you can find the full scripts.

13. Further reading#

Henson KE, Elliss-Brookes L, Coupland VH, Payne E, Vernon S, Rous B, Rashbass J. Data Resource Profile: National Cancer Registration Dataset in England. International Journal of Epidemiology, February 2020. https://doi.org/10.1093/ije/dyz076

UK Biobank. The use of International Classification of Diseases for Oncology (3rd edition) in UK Biobank. Version 1.1, February 2023. Available here.