Variables and “For” loops

Variables

A variable can contain a number, a character or a string of character.
It is meant to store temporarily a piece of information.

# Assign the string "hola" to the variable "myfirstvariable":
myfirstvariable=hola
# Show the content of the variable
echo $myfirstvariable

Store a directory path in a variable and list what is inside that directory:

# shortcut to a directory path
pathtodir=~/my_beautiful_folder

# list what is in that directory
ls $pathtodir

# store the output of "ls -l" and store it in "mylist"
mylist=$(ls -l)
mylist=`ls -l`

# Show content of "mylist"
echo $mylist
# Between quotes
myls="ls -l"

# Will run "ls -l" where you are
$myls
# create one variable:
myname=Sarah

# use this variable when creating a new one, either with single or double quotes:
mytext1='My name is $myname'
mytext2="My name is $myname"

# access the content of each variable:
echo $mytext1
My name is $myname

echo $mytext2
My name is Sarah
# create a variable
mynumber=1

# bash is looking for a variable called "number_one" !
echo $mynumber_one

# ${number} is corrected interpreted
echo ${mynumber}_one
# create the variable "num" that contains the number 2
num=2

# add 1 to "num"
echo $((num + 1))

# same as:
echo $(echo $num+1 | bc)

# same as:
echo `echo $num+1 | bc`

# divide by 3:
echo `echo $num/3 | bc`

# show 4 decimals:
echo "scale=4; $((num))/3" | bc

bc is basic calculator.


A few built-in variables exist, for example:

Variable Returns:
$USER the user name
$HOME the path of the home directory
$HOSTNAME the hostname of the machine you are currently connected to
$RANDOM a different random number each time it is accessed


Example:

echo my user name is $USER, I work on the machine $HOSTNAME and my home directory path is $HOME.

“For” loops

For loops are used to repeat certain tasks or blocks of code.

The basic construct is:

for i in 1 2 3 4 5
do
	echo $i
done
# 1 to 100
for i in {1..100}
do
        echo $i
done

# 1 to 100 with steps of 5
for i in {1..100..5}
do
        echo $i
done
# At each iteration, add 2 to each number:
for i in {1..100..5}
do
        echo $i + 2 is $((i + 2))
done

# At each iteration, multiply the number by itself:
for i in {1..100..5}
do
	echo $i multiplied by itself is $((i * i))
done
for i in *txt
do 
	echo $i
	wc -l $i
done
for i in *txt
do 
        echo $i
	newname=`echo $i | sed 's/txt/tab/'`
        mv $i $newname
done
for i in directory/*tab
do
        echo $i
        newname=`basename $i .tab`.txt
        mv $i $newname
done
for i in */
do
        echo $i
	# enter directory
	cd $i
	# count how many items the directory contains
	ls * | wc -l
	# leave the directory (go one directory up)
	cd ..
done
for i in `ls *.txt | grep -v "^m"`
do
        echo $i
done
for i in */; do echo $i; cd $i; ls * | wc -l; cd ..; done

Exercises

  1. Using the fastq files from Module 1:
    • create a for loop that extracts the sequences which contain ATGCGTAA and creates a new file containing the corresponding fastq entry (4 rows!).
    • Note: check parameters -A and -B of grep !

  2. Write a for loop and use your knowledge of variables to retrieve the fasta sequences of the following proteins from the Uniprot website:
    • Q9Y6G1, Q9NS00, Q9GZY8, O75843, Q3L8U1, P49810, P01584, O00182, Q02224, Q13547
    • Note: for one protein only (e.g. ID: O94907 gene: DKK1), you would use:
      wget https://www.uniprot.org/uniprot/O94907.fasta
      


  3. For each of these proteins, write a for loop to :
    • create a folder for each protein (using the protein ID).
    • move the fasta files of each protein into the appropriate folder.
    • (optional): change the name of the fasta file to add up the name of the corresponding gene (e.g. O94907.fasta will become O94907_DKK1.fasta). Note: the name of the gene is found in the header of each fasta file!
    • you will need: cd, mkdir, mv, basename, cut, grep.

      If you feel confident, do 2. and 3. together !

correction

# 1.
for file in *fastq.gz
do echo $file
zcat $file | grep -A 2 -B 1 "ATGCGTAA" > `basename $file .fastq.gz`_ATGCGTAA.txt
done

# 2.
for protein in Q9Y6G1 Q9NS00 Q9GZY8 O75843 Q3L8U1 P49810 P01584 O00182 Q02224 Q13547
do
wget https://www.uniprot.org/uniprot/${protein}.fasta
done

# 3. 
for protein in Q9Y6G1 Q9NS00 Q9GZY8 O75843 Q3L8U1 P49810 P01584 O00182 Q02224 Q13547
do
mkdir $protein
mv ${protein}.fasta $protein
cd $protein
genename=$(grep ">" ${protein}.fasta | cut -d"|" -f3 | cut -d"_" -f1)
mv ${protein}.fasta `basename $protein .fastq`_${genename}.fasta
cd ..
done

# 2. + 3. together
for protein in Q9Y6G1 Q9NS00 Q9GZY8 O75843 Q3L8U1 P49810 P01584 O00182 Q02224 Q13547
do
mkdir $protein
cd $protein
wget https://www.uniprot.org/uniprot/${protein}.fasta
genename=$(grep ">" ${protein}.fasta | cut -d"|" -f3 | cut -d"_" -f1)
mv ${protein}.fasta `basename $protein .fastq`_${genename}.fasta
cd ..
done

Back to the home page