Introduction
BioJazz is a tool for evolving and designing biochemical reaction networks using genetic algorithm (GA). Typically, a BioJazz user wishes to evolve or design a small network or motif which accomplishes a specific function, such as a switch or an oscillator module. The network comprises a set of proteins whose attributes are encoded in a network’s ‘‘genome’’. The ‘‘genome’’ is a binary string which contains all the information necessary to determine how many proteins are present in the network, their structure, which proteins interact and the biochemical parameters of their interaction.
BioJazz implements a genetic algorithm which through a random process of replication, mutation, and selection (with either population-based or random walk style), attempts to incrementally improve how well those “genomes” perform a user-specified function. By encoding the network in a fashion that mimics the way nature does, BioJazz can use a larger variety of mutational operators than do traditional GAs (which use point mutations and crossover), such as gene duplications, gene deletions, and domain shuffling. Thus, BioJazz has the ability to change and evolve networks with respect to both topology and biochemical parameters, by starting from a designed network ‘‘de novo’’, or a partially or completely functional seed network.
Much of BioJazz’s ability to design realistic networks comes from the accompanying Allosteric Network Compiler (ANC). ANC is a stand-alone, rule-based compiler which has the ability to turn a high-level description of allosteric proteins into the corresponding set of biochemical equations. The proteins can exhibit many of the behaviours observed in nature, such as co-localization, allosteric transitions, binding and catalytic reactions. BioJazz is likely the first tool to couple a rule-based compiler with an evolutionary algorithm.
The principal user input consists of a scoring function, which evaluates a particular genome against the desired functionality, and returns a score reflecting the network’s performance. This score is compared against the score of other networks to determine whether the network survives to the next generation and replicates. Generally speaking, this involves applying a stimulus to the network and evaluating it’s response. Simulation of the network is accomplished by integrating a set of ordinary differential equations (ODEs) in ‘‘Matlab’’. The required Matlab files are automatically generated from ANC’s output using a tool called Facile.
BioJazz is highly configurable. For example, the user can specify evolutionary parameters such as mutation rates. Also, the user may restrict BioJazz to changing a subset of the network’s attributes. This is useful to “freeze” the network topology, with the effect that only the network’s biochemical parameters and not its structure are allowed to evolve.
While the genetic algorithm itself is not very tasking, scoring each individual of a population of genomes may require a lot of processing power. Therefore, BioJazz has an integrated capability to use workstation clusters to speed the computation.
The main features of BioJazz are:
- evolves both network topology and connection weights
- designs a network ‘‘de novo’’, or starting from user-specified seed network
- uses workstation clusters to speed up the design
- produces a human-readable model of the winning network
- highly configurable and parameterized
####Citation:
Song Feng, Julien F. Ollivier, Peter S. Swain, and Orkun S. Soyer BioJazz: in silico evolution of cellular networks with unbounded complexity using rule-based modeling Nucleic Acids Res. 2015 : gkv595v1-gkv595.
###Getting Started
####Installation
BioJazz requires the ANC and Facile tools. You can tell BioJazz where to get them by setting the ANC_HOME and FACILE_HOME environment variables to point to the appropriate directories. It is recommended to add the following lines to your ‘’~/.bashrc’’ file:
export ANC_HOME=~/workspace/anc
export FACILE_HOME=~/workspace/facile
alias anc='$ANC_HOME/anc.pl'
alias facile='$FACILE_HOME/facile.pl'
export BIOJAZZ_HOME=~/workspace/biojazz
alias biojazz='$BIOJAZZ_HOME/biojazz.pl'
BioJazz requires Matlab to be installed on all nodes used for computation, and assumes matlab can be started with the command ‘‘matlab’’. Here is an example of configuration in “~/.bashrc” file (on Mac OS X):
export MATLAB_HOME=/Applications/MATLAB_R2011b.app/bin
alias matlab='$MATLAB_HOME/matlab'
export PATH=$MATLAB_HOME:$PATH
DYLD_LIBRARY_PATH=/Applications/MATLAB_R2011b.app/bin/maci64:/Applications/MATLAB_R2011b.app/sys/os/maci64:/Applications/MATLAB_R2011b.app/runtime/maci64:$DYLD_LIBRARY_PATH
export DYLD_LIBRARY_PATH
Note that if you decide to use a cluster of workstations, these installation instructions apply to all workstations used.
#####CPAN modules CPAN is an internet database of Perl modules. BioJazz/ANC/Facile use several of them and they must be installed prior to use. You will need system administrator priviledges to install these modules (or see for instructions on how to install them in your home directory). You or your sysadmin will typically need to run the following commands on each system used (use sudo as prefix if available, if you don’t have a admin privilege here is a solution let you install perl modules in your user directory):
cpan -i Class::Std
cpan -i Class::Std::Storable
cpan -i String::CRC32
cpan -i Expect
cpan -i Carp
cpan -i WeakRef
cpan -i IPC::Shareable
cpan -i Linux::Pid
cpan -i Text::CSV
Test your installation by running Facile, ANC and BioJazz without any arguments:
$FACILE_HOME/facile.pl
$ANC_HOME/anc.pl
$BIOJAZZ_HOME/biojazz.pl
An error will be reported if any of the required modules are still missing. Simply run CPAN again to install the missing module.
#####GraphViz
If you would like ANC to generate diagrams of the reaction network and species, you will the ‘‘dot’’ application and the following CPAN module:
cpan -i GraphViz
#####SSH Agent
####Workspace Creation Depending on your specific application, BioJazz will require some customized configuration and scoring functions. Also, during the course of a single design run, BioJazz will generate a large number of files. For this reason, the user must create a properly configured workspace which will contain the appropriate configuration files, scoring functions, and design files.
To facilitate this, BioJazz can create the workspace for and populate it with the required directories and with template files to get you started. To do this, run the following command:
biojazz --command='create_workspace("bjazz")'
This will create the directory bjazz
and various sub-directories including config
and custom
. Your configuration files go in the config
directory, while your custom scoring functions go in the custom
directory. At this point, the user should familiarize him/herself with some the template files that are provided, and try to run BioJazz.
The template file will try to design a network which contains a high concentration of dimers, and demonstrates how to use some of the functions available to the user.
cd bjazz
less config/template.cfg # template configuration file
less config/Template.pm # template application-specific scoring function
####Running BioJazz
After taking a look around, try to run BioJazz. The cluster_type
and cluster_size
arguments override the specification contained in the configuration file, and will launch both slave nodes of the cluster on your machine.
biojazz --config=config/template.cfg --tag=first_try --cluster_type="LOCAL" --cluster_size=2
This will evolve the network for only a couple generations. The tag
argument is very important. In BioJazz, each design attempt is associated with a specific, user-specified tag. BioJazz will create a directory in your workspace containing all the results and other files generated during the optimization. This allows the user to attempt several optimizations simultaneously without fear of accidental loss of files. The name of the design’s working directory is work_dir/tag
. The work_dir
parameter is specified in your configuration file (and has a value of template
in this example). Thus the results of the above run are contained in the directory template/first_try
.
[user@host bjazz]$ ls -la template/test_main/
total 168
drwx------ 5 user group 4096 2008-06-03 14:53 .
drwx------ 3 user group 4096 2008-06-03 14:51 ..
drwx------ 2 user group 4096 2008-06-03 14:53 matlab
drwx------ 2 user group 4096 2008-06-03 14:53 obj
-rw------- 1 user group 71904 2008-06-03 14:53 ScorNode.0.20080603_145159.log
-rw------- 1 user group 67215 2008-06-03 14:53 ScorNode.1.20080603_145159.log
drwx------ 2 user group 4096 2008-06-03 14:51 source_2008-06-03-14:51:58
The obj
directory contains all the genomes generated in a machine-readable form. The matlab
contains the models generated by ANC, and the matlab
scripts generated by Facile. The ScorNode*
files are a log of the activity of each node in the cluster as work to compile and score genomes. The source*
directory is a snapshot of the source code used for that run such as your configuration and custom scoring files.
Now try modifying the configuration file to use other available workstations and run BioJazz again…
####Using a Makefile The program make is a UNIX classic and is useful in the context of BioJazz: not only for running BioJazz’s suite of tests, but also as a sophisticated aliasing mechanism. A Makefile is provided when your workspace is created, which contains some useful functions and can serve as a template for customization. The table below documents a number of makefile targets which correspond to some of the most typical and common tasks you will perform. Explaining how GNU’s make program works is well beyond the scope of this document, but one of make’s most useful switches is the dry-run (-n) switch. This switch will show you the commands to be executed, without actually executing them. You can use this to take a look at the BioJazz command and arguments that are generated for each of the targets below.
Target(Arguments) | Example | Description |
---|---|---|
Testing | ||
test_custom (module) | make test_custom module=MyDesign | Executes the run_testcases routine that should be contained in your application-specific design class. Typically, the run_testcases routine should hand-craft a genome, then run your scoring function on this genome. This allows you to test your scoring function before trying to evolve some networks. |
Evolution/Scoring | ||
evolve (config, tag, cluster_type, cluster_size) | make evolve config=ultrasensitive.cfg tag=08jan01 cluster_type=SSH cluster_size=5 | Starts BioJazz and evolves a circuit as per the given configuration file. All output is echoed to your terminal saved to a logfile for subsequent analysis. |
score (config, tag, genome) | make score config=ultrasensitive.cfg tag=08jun25c genome=ultrasensitive/08jan01/obj/G199_I00.obj | Runs your custom scoring function on the specified genome. All the files are created in the scratch space for this design. This target is useful for debugging unanticipated problems that crop up for a specific genome during an evolution run. |
Statistics | ||
collect (config, tag) | make collect config=ultrasensitive.cfg tag=08jan01 | Collects population data from the logfile generated during an evolution run, and save them to a file in object form. |
analyze (config, tag) | make analyze config=ultrasensitive.cfg tag=08jan01 | Reads the population data collected above and performs some simple analysis tasks such as finding the top-scoring individual and computing statistics. The results are saved to an Excel spreadsheet for charting. |
Shell | ||
shell (config, tag) | Starts BioJazz in shell mode. | |
load (config, tag, genome) | Start BioJazz, load the specified genome, and go into shell mode. | |
Admin/Utils | ||
retag(config, old_tag, new_tag) | make retag config=ultrasensitive.cfg old_tag=temp new_tag=08jan01 | Moves and renames design files as appropriate so that they appear under the new_tag. |
clean (config, tag) | make clean config=ultrasensitive.cfg tag=08jan01 | Deletes all design files of the given tag. |
clean_all (config, tag) | make clean_all config=ultrasensitive.cfg tag=08jan01 | Deletes all design files of the given tag, including those in the scratch space (generated by the score target). |
####Scoring a specific genome
####Collecting and Analyzing Statistics
####BioJazz Shell
####Writing and Testing Application Specific Functions
#####Template Files
#####Testing
#####Debugging when things go wrong
####Workspace Directory Structure
bjazz # workspace home
config # configuration files
custom # application-specific modules and functions (incl. scoring function)
test/custom # recommended location for test results of custom modules
test/modules # BioJazz module test results
mydesign # application-specific directory
mydesign.08jun01.log # master node logfile
08jun01 # results directory for run with TAG=08jun01
ScorNode.i.timestamp.log # log file for slave node i
matlab # ANC genome models, eqn files, and matlab files
obj # genome objects in binary form
###Evolutionary Algorithm
BioJazz uses a genetic algorithm (GA) to search for a functional network. The idea behind a genetic algorithm is to encode each candidate network as a ‘‘genome’’, which is just a sequence of bits having the value 0 or 1. Given a population of individual network genomes, the GA calls a user-specified scoring function on each of the individuals in the population to determine it’s ‘‘fitness score’’. After computing the fitness scores, the GA creates a new generation of individual networks by cloning and mutating the previous generation in such a way that fitter individuals tend to have more descendants than less fit individuals. Thus, depending on its relative fitness score, the fate of a particular individual network may be to:
-
pass unchanged to the next generation (elite duplication)
-
generate one or more mutated clones (asexual duplication)
-
generate one or more children
-
have no descendants at all (elimination)
Networks with a higher score has a better chance of fates i), ii) or iii) (i.e. surviving) than networks with a relatively lower score.
The number of creation created is such that the population size is kept constant from generation to generation. The population size is specified via the parameter ‘‘max_population’’.
The following sections detail each step in the genetic algorithm.
####Initial Generation
The initial generation can be either generated randomly or loaded from disk, as specified by the initial_generation
parameter of the configuration file. In the random case, the user can also specify the number of individuals to create (parameter inum_genomes
) and the genome length (parameter – currently fixed at 5000). Loading from disk is useful to resume work on a partially completed design starting from the last generation created, or to load hand-crafted seed designs. The following shows some examples for each case:
initial_genome = random # random generation
initial_genome = load test/modules/Ultrasensitive.obj # load a hand-crafted network
initial_genome = load ultrasensitive/test/obj/G427_I*.obj # load all individuals of generation 427 of previous run
Regardless of how the initial generation is created, each individual is then stored under the following name in the working directory of the design:
$DESIGN_WORK/obj/G`ggg`_I`ii`.obj
Where ggg
is the generation number and ii
is the individual number.
####Scoring
- the score_genome method of a user-defined scoring class is called to score individual networks
- the scoring class inherits many useful method from ‘‘Scoring’’
- describe Scoring class, its use of MatlabDriver, etc.
####Replication
There are two modes of replication in BioJazz. They are rank-based and fitness-based replication.
#####Rank-based replication
In the rank-based mode, the fitness score returned by the user-defined evaluation routine are not used as is to compute the actual fitness of each network (the actual fitness being the expectation value for the number of children). This mode is appropriate when small numerical differences in the score do not meaningfully correspond to the number of children an individual should have. Therefore, BioJazz uses the fitness score of each individual to produce a ranking of the entire generation.
Children are generated according to the following procedure. First, the top ‘‘elite_pool_size’’ individuals are copied as is to the next generation, without any mutations. Next, an additional (‘‘max_population’’ - ‘‘elite_pool_size’’) children need to be created to keep the population size constant.
#####Fitness-based replication
In this mode, the computed fitness dictates the relative proportion of the available pool of children that a genome is to have.
Children are generated according to the following procedure. First, the top ‘‘elite_pool_size’’ individuals are copied as is to the next generation, without any mutations. Next, an additional (‘‘max_population’’ - ‘‘elite_pool_size’’) children need to be created to keep the population size constant. Each individual in the current generation is a potential parent and its expected number of children is proportional to the fitness computed from the ranking as described above.
To decide which individuals become parents a sampling procedure called the “roulette wheel” is used. Imagine a roulette wheel where each parent has a slot whose size is proportional to the fitness….
####Mutation Operators #####Point Mutation #####Gene Duplication #####Gene Deletion #####Domain Duplication #####Domain Deletion #####Domain Shuffling
###Network Encoding Model
####Complementarity Encoding Scheme
The complementarity encoding scheme is currently implemented in BioJazz.
Field Name | Length (L) | Allowed Values RegExp | Scaling | Dynamic Range | Resolution | Description |
---|---|---|---|---|---|---|
Genome | ||||||
PRE_JUNK | Any | [01]* | Untranslated sequence preceding first gene | |||
genes | 1 or more genes | Any legal gene | One or more genes separated by untranslated sub-sequences | |||
POST_JUNK | Any | [01]* | Untranslated sequence following last gene | |||
Genw | ||||||
START_CODE | 8 | 01111110 | Untranslated sequence preceding first gene | |||
concentration | 10 | [01]{L} | Log-linear | 1e3/1e-3 | 1.36% | Encodes the initial concentration of the protein |
UNUSED | 4 | [01]{L} | Reserved field | |||
domains | 1 or more domains | Any legal domain | One or more domains separated by a soft linker pattern (“001”) | |||
STOP_CODE | 3 | 111 | Terminates the gene | |||
Domain | ||||||
allosteric flag | 1 | [01]{1} | Maps directly to ANC domain attribute of the same name | |||
R <–> T rates | 10 | [01]{L} | Loglinear | 1e+2/1e-2 | 0.90% | Embedded in ANC model |
Φ | 10 | [01]{10} | Linear | 0.0 - 1.0 | ~1e-3 | Scaled to a number between 0 and 1 (inclusive), embedded in ANC model |
UNUSED | 4 | [01]{L} | Reserved field | |||
protodomains | 1 or more protodomains | Any legal protodomain | One or more protodomains separated by a hard linker pattern (“000”) | |||
ProtoDomain | ||||||
type | 2 | [01]{2} | Maps to protodomain type in ANC model, determines enzyme polarity. 00=bsite, 01=msite, 10=csite, 11=csite | |||
substrate polarity | 1 | [01]{1} | If the protodomain is a csite, determines whether it modifies (0, kinase) or unmodifies (1, phosphatase) the substrate | |||
binding profile | 10 | [01]{L} | Determines ligands pairs. If the binding profiles of two protodomains are sufficiently complementary, then a binding reaction occurs using kinetic rates calculated from the following attributes | |||
kf profile | 20 | [01]{L} | Log-linear profile | 1e3/1e-3 | 1.99x | Determines the association kinetics of two protodomains |
kb profile | 20 | [01]{L} | Loglinear-profile | 1e3/1e-3 | 1.99x | Determines the dissociation kinetics of two protodomains |
kp profile | 10 | [01]{L} | Loglinear | 1e3/1e-3 | 1.36% | For csite protodomains only, determines rate of product reaction when modifying |
Keq ratio (Γ) | 10 | [01]{10} | Loglinear | 1e2/1e-2 | 0.90% | Determines allosteric effect of msite modification. Embedded in ANC model |
kf polarity mask | 20 | [01]{L} | XORed with “kf profile” to determine profile of modified (msite=1) version of protodomain | |||
kf conformation mask | 20 | [01]{L} | XORed with “kf profile” to determine new profile of ‘T’ conformation | |||
kb polarity mask | 20 | [01]{L} | XORed with “kb profile” to determine profile of modified (msite=1) version of protodomain | |||
kb conformation mask | 20 | [01]{8} | XORed with “kb profile” to determine new profile of ‘T’ conformation | |||
kp conformation mask | 10 | [01]{8} | XORed with “kp profile” to determine new profile of ‘T’ conformation | |||
UNUSED | 4 | [01]{L} | Reserved field |
####Field Scaling
All the above fields correspond to parameter values whose scale and dynamic range is given in a configuration file. This section deals with the issue of translating an integer (binary encoded) scale into parameter values. In all sections, x is the integer value corresponding to a binary field, which goes from 0 to x_max=2^L-1 where L is the length of the field. Also, y_min is the minimum value of the parameter, and y_max is the maximum value. For example, the regulated_concentration parameter can be made to range from 1e-3 (y_min) to 1e+3 (y_max).
#####Linearly Scaled Fields
Relevant field(s): Φ.
Here the value of the parameter is y(x) = y_min + (y_max - y_min)*(x/x_max);
The resolution R(n) for an n-bit field is given by:
R(n) = (y_max-y_min)/(2^n-1)
#####Loglinear Scaled Fields
Relevant field(s): kp_profile, RT/TR_transition_rate, Keq_ratio, regulated_concentration
y(x) = y_min * (y_max/y_min)^(x/x_max), or
log(y) = log(y0) + (x/x_max)*log(y_max/y_min)
The fold-change between two values of x which differ by 1.0 is:
FC = y(x+1)/y(x) = (y_max/y_min)^(1/x_max).
For example, an L=8-bit field associated with a dynamic range of y_max/y_min = 1e6 implies xmin=0, x_max=2^L-1=255 and a fold-change per increment of x of:
FC(8bits) = 1e6^(1/255) = 1.0557
In other words, you can change by 5.57% increments. For 10 bits,
FC(10bits) = 1e6^(1/1023) = 1.0136.
#####Loglinear Scaled Profiles
Fields: kf_profile, kb_profile
Profiles do not directly encode parameters, but instead are compared to each other after applying a special complement() function to either one of the arguments. The result of this comparison is used to obtain a Hamming distance of 0 to at most L. This Hamming distance is then scaled log-linearly. Thus if two profiles are 8 bits long, x_min=0 and x_max=L=8, so for a dynamic range of 1e6 we have
FC = 1e6^(1/8) = 5.6234
I.e. a 560% change per increment. More appropriate would be to use 10-30 bits:
FC(10) = 1e6^(1/10) = 3.9811
FC(20) = 1e6^(1/20) = 1.9953
FC(30) = 1e6^(1/30) = 1.5849
I.e. a 2x or 60% change per increment.
######Improvement
Note that there is some inefficiency in this method of coding because we need extremely long fields to get good resolution, because x_max=L and not 2^L.
Currently for profiles X and Y, we have:
MISMATCH_VECTOR = XOR(X, complement(Y)) MISMATCH = ones(MISMATCH_VECTOR)
Equivalently,
MISMATCH = HAMMING(X, complement(Y))
An alternative would scale the MISMATCH_VECTOR, which has a range of 2^L, instead of MISMATCH, which has a range of just L. Caveat is that we must get the same thing if the profiles are commuted, but in fact the MISMATCH_VECTOR gets flipped, so the msb becomes the lsb. One way around this is to designate the 2 outer bits as msbs, and the 2 most inner bits as lsbs. You break the MISMATCH_VECTOR in two, flip the 2nd half and add to the first. This gets us a range of 0 to [2^(L/2)-1]*2.
###Scoring Functions
Authors and Contributors
The project was initialized by Julien Ollivier, then continuely developed and is currently maintained by Song Feng ( @LifeWorks ).
Support or Contact
If have any problems, please contact @LifeWorks or email lifeworks.song@gmail.com