Test data sets

From ACGT Competition

Jump to: navigation, search

Overview: Two clinico-genomics datasets are made available to the participants to the ACGT competition.

  • "MCMP" (Multi-Centric Multi-Platform) This dataset is a subset of a larger one in developed in ACGT in which breast-cancer gene-expression data from two microarray platforms are combined. The data obtained with one of the platforms have been published.
  • "SIOP" (Société Internationale d'Oncologie Pédiatrique) This dataset is made of data collected from a clinical study on kidney cancer in children (pediatric nephroblastoma). Gene-expression data were collected jointly with demographics data.

These data sets are described further below.

MCMP

This dataset contains breast-cancer gene-expression and demographics data. Gene expression data are provided in the format of two (single-color) microarray technologies: Affymetrix and Illumina. Illumina data are simulated, but based on the real clinical data obtained with the Affymetrix technology.

Affymetrix data have been investigated with results published in the literature (see PDF file):

Files in the data set are:

  • JCO tam 2007.pdf : A copy of the original research article.
  • demo_public.csv : Demographics data for the patients in the study (*)
  • CELfiles.zip : Affymetrix gene-expression data (*)
  • HG-U133A.annot.csv : Annotation file for the HG-U133A microarray platform (*)
  • HG-U133B.annot.csv : Annotation file for the HG-U133B microarray platform (*)
  • simulatedillumina.zip : Gene-expression data in Illumina format
  • Human_RefSeq-8.csv : Annotation file for the Illumina platform
  • rs2gid2hugo.csv : A mapping between RefSeq sequence identifiers, Entrez Gene identifiers and Gene symbols.

Files marked with an asterisk in the list above contain all the information to reproduce the research result published in the article. Illumina data are provided for competitors interested in developing multi-platform methodologies.

Demographics data can also be retrieved from a table in a MySQL database, using the Data Access Services provided in ACGT. Please request connection details from the competition management, specifying that you wish to access the database "acgt_transbig" on the server "Iapetus".


SIOP

The SIOP data set contains pediatric-nephroblastoma clinical and gene-expression data obtained from 77 patients. Gene-expression have been obtained using custom-made two-color microarrays (in general 2 hybridizations per patient). The data provided here result from a real clinical study, which results have been published. Raw microarray data has been made available on the ArrayExpress platform.

Files in the data set are:

  • PMID16287080.pdf : Research article in PDF format
  • A-MEXP-111.ft_annot.csv : Original microarray annotation file (Annotation: 2006)
  • A-MEXT-111.gal : Description of microarray layout in GAL format
  • annot.dat.20091208 : Microarray-sequence annotation (Annotation: December 2009)
  • clinical.csv : Research-paper clinical data
  • kegg_annot.txt : Annotation of KEGG pathways
  • kegg_paths.txt : KEGG pathways genes
  • mapping_molid.csv : Mapping between patient identifiers (ws*) and microarray identifiers (E-MEXP-221-*)
  • microarrays.zip : A ZIP archive with all microarray files
  • wilms_retino_pathways.csv : Genes found associated to tumor mechanisms in orginal research paper.

In the raw data, there are multiple arrays per patients and each gene can be represented by multiple arrays features. For competitors interested in mining data without considering basic technological aspects of microarray technology, two gene-expression matrices are provided where this complexity has been reduced. These matrices are stored in the files A2.csv and M2.csv. A2 is the average log2 signal for each feature (A-values in 2-color microarray terminology), M2 is the log2 of the fold-change (M-values). With obvious columns and row headers, columns in those files represent patients and row represent genes. Signal has been patient-averaged and genes were selected by applying a variance filter among features associated to the same gene (the feature with most variable signal was kept, other discarded).

In a more complex setup, clinical data can be retrieved through queries to the ACGT ObTiMA database. Please request contact details from the competition management.

Microarrays can be retrieved through the ACGT BASE Data Access Services, using "wilms_assays" as service name.

Challenges in the competition

Examples of ideas to be developed in the context of the competition:

  • Develop a toolbox of statistical procedures working simultaneously on single-color (MCMP) and two-color (SIOP) microarrays.
  • Use the literature mining tool available in ACGT to provide extra information on the genes found significant in the example data sets.
  • Develop generic reporting methods to convey results obtained with GridR to clinicians.
  • ... Find your own challenge... Be bold!
Personal tools