Data used in this course
For this course, we will use public data sets from GEO data set GSE76647.
Exercise:
- Go to the GEO page corresponding to entry GSE76647
- What is the organism the samples come from? Which types of cells? What is the goal of the experiment?
- What information do you get about the sequencing protocol?
- Retrieve the SRA codes corresponding to all samplesfrom the SRA RUN Selector)
Data details
- Data set GSE76647 has 10 Homo sapiens samples from differentiated (5-day differentiation) and undifferentiated primary keratinocytes.
Some of the samples underwent a knock-down of the FOXC1 gene:
GEO ID | SRA ID | Sample name | Differentiation | Condition |
GSM2031982 | SRR3091420 | 5p4_25c | undiff | WT |
GSM2031983 | SRR3091421 | 5p4_27c | undiff | WT |
GSM2031984 | SRR3091422 | 5p4_28c | diff 5 days | WT |
GSM2031985 | SRR3091423 | 5p4_29c | diff 5 days | WT |
GSM2031986 | SRR3091424 | 5p4_30c | diff 5 days | WT |
GSM2031987 | SRR3091425 | 5p4_31cfoxc1 | undiff | WT |
GSM2031988 | SRR3091426 | 5p4_32cfoxc1 | undiff | WT |
GSM2031989 | SRR3091427 | 5p4_33cfoxc1 | undiff | WT |
GSM2031990 | SRR3091428 | 5p4_34cfoxc1 | diff 5 days | WT |
GSM2031991 | SRR3091429 | 5p4_35cfoxc1 | diff 5 days | WT |
- The raw data can be downloaded as follows, using fastq-dump
# Go to the "raw_data" folder
cd ~/rnaseq_course/raw_data
# Download raw data files
for sra in SRR309142{0,1,2,3,4,5,6,7,8,9}
do echo $sra
$RUN fastq-dump --gzip --origfmt --skip-technical --split-files $sra
done
- BACK UP
In case downloading those files is too slow, we have prepared fastq files that correspond to chromosome 6 only. Please download it:
# go to raw data folder
cd ~/rnaseq_course/raw_data
# archive containing the 6 fastq files for chromosome 6 only
wget https://public-docs.crg.es/biocore/projects/training/PHINDaccess2020/fastq_chr6.tar.gz
# extract archive
tar -xvzf fastq_chr6.tar.gz
# remove .tar.gz file
rm fastq_chr6.tar.gz
- How many reads are there per sample?
- What is the read length?
Number of reads:
cd ~/rnaseq_course/raw_data/fastq_chr6
# With grep
zcat SRR3091420_1_chr6.fastq.gz | grep "^@" | wc -l
# With paste
zcat SRR3091420_1_chr6.fastq.gz | paste - - - - | wc -l
# With awk
zcat SRR3091420_1_chr6.fastq.gz | awk 'BEGIN{i=0}{i++;}END{print i/4}'
For all samples:
for fastq in *fastq.gz
do echo $fastq
zcat $fastq | awk 'BEGIN{i=0}{i++;}END{print i/4}'
done
Count read length:
# Count read length for all rows
zcat SRR3091420_1_chr6.fastq.gz | paste - - - - | awk '{print length($2)}'
# Summarize
zcat SRR3091420_1_chr6.fastq.gz | paste - - - - | awk '{print length($2)}' | sort | uniq -c