RNA-Seq data repositories

Public data repositories exist that store data (“raw” and processed) produced by the community from a variety of experiments: microarrays, high-throughput sequencing, high throughput PCR, etc.

It is nowadays required by most journals to make data publicly available upon publication of study in a peer-reviewed journal.

The major repositories for gene expression data:

These repositories are linked to the repositories of NGS raw data (Fastq files):

  • SRA (Sequence Read Archive)
  • ENA (European Nucleotide Archive)
  • DDBJ-DRA


EXERCISE

Let’s explore this GEO record (GSE76647)

  • Which platform and protocol were used for sequencing?
  • What type of RNA was sequenced? From which organism?
  • How many samples were sequenced?


Downloading data from a public repository

fastq-dump program from the SRA toolkit allows you to retrieve raw data from the SRA platform, using the command line:

fastq-dump --gzip --origfmt --split-files --skip-technical SRR-IDENTIFIER

# For us, it would be:
$RUN fastq-dump --gzip --origfmt --split-files --skip-technical SRR-IDENTIFIER

The options used here are:

  • –split-files for paired-end data (if omitted, fastq-dump outputs a single interleaved file)
  • –origfmt: to avoid the generic “SRA” naming. Keep the original name of the reads.
  • –gzip: get a gzip-compressed fastq file (fastq files can ve very storage consuming!)
  • –skip-technical: download only biological reads (do not output barcodes, linkers, etc.)

EXERCISE

Going back to the previous GEO record:

  • Where can you find the SRA identifiers (code SRR…), for each sample?
  • How large are the raw data files?


Another source of high quality data on gene expression in human and mouse is The Encyclopedia of DNA Elements (ENCODE). Using the ENCODE portal one can access data produced by the ENCODE Consortium.