Variables and “For” loops
Variables
A variable can contain a number, a character or a string of character.
It is meant to store temporarily a piece of information.
- Why do you want to learn about variable ?
- to store paths or long strings of characters.
- to run loops (for, while, etc.).
- inside bash/shell scripts (e.g. arguments passed to scripts are variables).
- Create a variable:
# Assign the string "hola" to the variable "myfirstvariable":
myfirstvariable=hola
- Variable names can be written in:
- uppercase
- lowercase
- a mixture of both
- Access the content of a variable with $
- Show the content of a variable with echo
# Show the content of the variable
echo $myfirstvariable
Store a directory path in a variable and list what is inside that directory:
# shortcut to a directory path
pathtodir=~/my_beautiful_folder
# list what is in that directory
ls $pathtodir
- Command substitution:
- Take the output of a command and save it as the value of a variable:
# store the output of "ls -l" and store it in "mylist"
mylist=$(ls -l)
mylist=`ls -l`
# Show content of "mylist"
echo $mylist
- Short cut to command:
- Shortcut to a command or a set of commands:
# Between quotes
myls="ls -l"
# Will run "ls -l" where you are
$myls
- Using quotes:
- single quotes ‘ ‘ treat each character literally.
- double quotes “ “ can access the content of variables.
# create one variable:
myname=Sarah
# use this variable when creating a new one, either with single or double quotes:
mytext1='My name is $myname'
mytext2="My name is $myname"
# access the content of each variable:
echo $mytext1
My name is $myname
echo $mytext2
My name is Sarah
- Use curly brackets if there are ambiguities:
- = } isolate the variable name
# create a variable
mynumber=1
# bash is looking for a variable called "number_one" !
echo $mynumber_one
# ${number} is corrected interpreted
echo ${mynumber}_one
- Calculations on variables:
# create the variable "num" that contains the number 2
num=2
# add 1 to "num"
echo $((num + 1))
# same as:
echo $(echo $num+1 | bc)
# same as:
echo `echo $num+1 | bc`
# divide by 3:
echo `echo $num/3 | bc`
# show 4 decimals:
echo "scale=4; $((num))/3" | bc
bc is basic calculator.
A few built-in variables exist, for example:
Variable | Returns: |
---|---|
$USER | the user name |
$HOME | the path of the home directory |
$HOSTNAME | the hostname of the machine you are currently connected to |
$RANDOM | a different random number each time it is accessed |
Example:
echo my user name is $USER, I work on the machine $HOSTNAME and my home directory path is $HOME.
“For” loops
For loops are used to repeat certain tasks or blocks of code.
The basic construct is:
-
At each iteration (repetition) of the loop, VARIABLE is assigned a value from RANGE, sequentially.
-
In the example below:
- at the first iteration, 1 is assigned to i
- at the second iteration, 2 is assigned to i
- and so on…
for i in 1 2 3 4 5
do
echo $i
done
- Loop on longer ranges of values:
# 1 to 100
for i in {1..100}
do
echo $i
done
# 1 to 100 with steps of 5
for i in {1..100..5}
do
echo $i
done
- Make operations on variables in a loop:
# At each iteration, add 2 to each number:
for i in {1..100..5}
do
echo $i + 2 is $((i + 2))
done
# At each iteration, multiply the number by itself:
for i in {1..100..5}
do
echo $i multiplied by itself is $((i * i))
done
- Use a for loop to check the number of rows of all text files in a folder:
for i in *txt
do
echo $i
wc -l $i
done
- Use a for loop to change the extension of files:
- *txt is looking for all files which names end with txt.
for i in *txt
do
echo $i
newname=`echo $i | sed 's/txt/tab/'`
mv $i $newname
done
- Use basename to keep only the file name but not the path (when looping around files that are not in the current directory for example):
- basename take as a first argument the name of a variable, and as a second argument the suffix to remove (typically the extension of a file).
for i in directory/*tab
do
echo $i
newname=`basename $i .tab`.txt
mv $i $newname
done
- Loop around folders only:
for i in */
do
echo $i
# enter directory
cd $i
# count how many items the directory contains
ls * | wc -l
# leave the directory (go one directory up)
cd ..
done
- Loop on selected files: all text files that DO NOT start with “m”:
for i in `ls *.txt | grep -v "^m"`
do
echo $i
done
- Write a for loop as a one-liner with ; (semicolon):
for i in */; do echo $i; cd $i; ls * | wc -l; cd ..; done
Exercises
- Using the fastq files from Module 1:
- create a for loop that extracts the sequences which contain ATGCGTAA and creates a new file containing the corresponding fastq entry (4 rows!).
- Note: check parameters -A and -B of grep !
- Write a for loop and use your knowledge of variables to retrieve the fasta sequences of the following proteins from the Uniprot website:
- Q9Y6G1, Q9NS00, Q9GZY8, O75843, Q3L8U1, P49810, P01584, O00182, Q02224, Q13547
- Note: for one protein only (e.g. ID: O94907 gene: DKK1), you would use:
wget https://www.uniprot.org/uniprot/O94907.fasta
- For each of these proteins, write a for loop to :
- create a folder for each protein (using the protein ID).
- move the fasta files of each protein into the appropriate folder.
- (optional): change the name of the fasta file to add up the name of the corresponding gene (e.g. O94907.fasta will become O94907_DKK1.fasta). Note: the name of the gene is found in the header of each fasta file!
- you will need: cd, mkdir, mv, basename, cut, grep.
If you feel confident, do 2. and 3. together !
correction
# 1.
for file in *fastq.gz
do echo $file
zcat $file | grep -A 2 -B 1 "ATGCGTAA" > `basename $file .fastq.gz`_ATGCGTAA.txt
done
# 2.
for protein in Q9Y6G1 Q9NS00 Q9GZY8 O75843 Q3L8U1 P49810 P01584 O00182 Q02224 Q13547
do
wget https://www.uniprot.org/uniprot/${protein}.fasta
done
# 3.
for protein in Q9Y6G1 Q9NS00 Q9GZY8 O75843 Q3L8U1 P49810 P01584 O00182 Q02224 Q13547
do
mkdir $protein
mv ${protein}.fasta $protein
cd $protein
genename=$(grep ">" ${protein}.fasta | cut -d"|" -f3 | cut -d"_" -f1)
mv ${protein}.fasta `basename $protein .fastq`_${genename}.fasta
cd ..
done
# 2. + 3. together
for protein in Q9Y6G1 Q9NS00 Q9GZY8 O75843 Q3L8U1 P49810 P01584 O00182 Q02224 Q13547
do
mkdir $protein
cd $protein
wget https://www.uniprot.org/uniprot/${protein}.fasta
genename=$(grep ">" ${protein}.fasta | cut -d"|" -f3 | cut -d"_" -f1)
mv ${protein}.fasta `basename $protein .fastq`_${genename}.fasta
cd ..
done