RNA-Seq data repositories

The major repositories for gene expression data:

These repositories are linked to the repositories of NGS raw data (Fastq files):

  • SRA (Sequence Read Archive)
  • ENA (European Nucleotide Archive)
  • DDBJ-DRA


EXERCISE

Let’s explore one of the GEO records

  • Which platform and protocol were used for sequencing?
  • What type of RNA was sequenced?
  • How many samples were sequenced?


NOTE: You will need to download data from SRA for an independent project after this week! To download raw data from SRA, it is possible to use fastq-dump program from SRA toolkit or to download files from the NCBI ftp website using wget (for detail, see https://www.ncbi.nlm.nih.gov/books/NBK158899/#SRA_download.when_to_use_a_command_line.

To download data, use the SRA identifier specifying whether reads are single or paired-end, otherwise paired-end reads will be downloaded as a single interleaved file; for paired-end reads use the parameter –split-files in fastq-dump. Fastq-dump adds SRA ID to each read in the file, to avoid it, use an option –origfmt; –gzip compresses fastq files immediately after download; and –skip-technical downloads only biological reads. The command below will download fastq-file(s) for one sample only (for example, using SRR identifier SRR8571764 from the exercise above; it is slow - it might take up to 30-40 minutes):

fastq-dump --gzip --origfmt --split-files --skip-technical SRR-IDENTIFIER

To download all samples for a specific GEO experiment, use the SRA study identifier (e.g., for the GEO experiment considered above, it is SRP185848) and follow the steps:

  • First, download a list of SRR identifiers for all samples in the study by going to the NCBI SRA page for this study and clicking on the right top “Send” –> “File” –> “Accession List” –> “Save to file”. That will give you the text file with all SRR identifiers for this study; save it for example to the file “sra_ids.txt”.
  • Second, run the following loop (it will download fastq files for samples one by one, not in parallel):
while read SRA; do fastq-dump --gzip --origfmt --skip-technical --split-files $SRA; done < sra_ids.txt > log &


Another source of high quality data on gene expression in human and mouse is The Encyclopedia of DNA Elements (ENCODE). Using the ENCODE portal one can access data produced by the ENCODE Consortium.