MiRPlex is a tool for microRNA prediction from high-throughput sequencing data that requires only sRNA datasets as input.
Purpose
MiRPlex is a tool for microRNA prediction from high-throughput sequencing data that requires only sRNA datasets as input.. Mature miRNAs are predicted from such datasets through a multi-stage process, involving filtering, miRNA:miRNA* duplex generation and duplex classification using a support vector machine. Tests on sRNA datasets from model animals demonstrate that the strategy is effective at predicting genuine miRNA duplexes, and, for some sets, achieves a high degree of precision when considering only the mature sequence.
Pipeline
The tool works in several stages:
Stage 1: Building Duplexes from an sRNA dataset. This stage is optional. The user should supply a fasta format file with adaptors trimmed as input. If only a raw sequence file is available the UEA sRNA Workbench can be used to trim the adaptors and convert to fasta format. MiRPlex then processes this file a generates a set of duplexes for consideration by the SVM in stage 2. The stage is divided into a number of sub-steps, all of which are optional:
Assemble sRNAs into contigs and those sRNAs that participate in contigs are discarded. NOTE: This step is only available on linux machines.
Duplicate remaining sRNAs and use mature and star filter on the respective datasets. Each filter can be defined by allowable sRNA length, complexity, ambiguity and abundance.
Pair duplexes from remaining mature and star sRNA datasets. Duplexes can be further filtered based on their Levenshien distance. Remaining duplexes features are analysed and recorded.
Stage 2: Duplexes are tested against defined SVM models. Two models are supplied: 1 for plants and 1 for animals. Each duplex is scored with a likelihood of being a genuine miRNA. Results can be filtered based on this score.
Stage 3: In order to gauge how well the SVM has performed it is possible to benchmark the results against miRBase and a genome. Several sub steps are performed:
The duplex's mature sequence is matched against mature sequences in miRBase.
Both sRNAs in the duplex are matched against hairpins in miRBase.
Both sRNAs in the duplex are aligned against a supplied genome and tested based on their genomic locations. Duplexes that contain sRNAs that are close to one another are considered valid.
Installation
Simply unzip the downloadable file into the directory of your choice. mirplex_0.1.zip (ZIP, 63Mb)
Tested Platforms
This program has been tested on Windows 7 64-bit and Linux Centos 4. The program should however work on all common platforms that possess JRE 1.6 and above.
Usage
miRPlex is a command line driven program written in Java. Although it contains dependencies written in native code it should work on all major platforms. To run the program navigate to the root directory of the program and type the following:
java -jar Workbench.jar --tool mirplex --params <path to param file here>
A description of contents of a parameters file is provided in the next section.
miRPlex can use significant amounts of memory. Therefore it is recommended that the user provides an indication to the JVM of the amount of usable memory available on the system. This is achieved by using the -Xmx switch in the command line. See java documentation for more details on the Xmx argument.
Parameters
A parameter file is required for running miRPlex. The parameter file stores the arguments detailing how miRPlex should operate. The parameters file is a plain text file with a simple key-value pair syntax: "param_name=param_value". The file can be broken up using whitespace as required. Comment lines can also be added by starting the line with a '#' symbol. The parameters files can contain the following keys:
- srna_file - The full path to the sRNA file containing reads (with adaptors trimmed) to process. Required for stage 1 processing.
- build_duplexes - Specifies whether to build duplexes from the srna_file. Set to 'true' to conduct stage 1 processing. Default: true
- contig_analysis - Specifies whether to discard sRNAs from the srna_file based on whether they form contigs. An assembly program "velvet" is used to assemble the contigs from sRNAs if requested. Default: "false".
- contig_safe_abundance - If contig analysis is performed, then sRNAs participating in contigs are removed, however, if the sRNA has a sufficiently high abundance (the threshold value is set by this parameter) then it is considered "safe" and is not removed from further analysis. To disable this feature set the value to -1. Default: 25
- contig_kmer - Contig Kmer length. See velvet documentation for further details. Default: 15
- contig_min_length - The minimum contig length. See velvet documentation for further details. Default: 30
- mature_filter_params - The full path to another parameters file which specifies details of how to filter the sRNA dataset for mature miRNA candidates.
- star_filter_params - The full path to another parameters file which specifies details of how to filter the sRNA dataset for star miRNA candidates.
- allowed_duplex_distance - When testing for a valid duplex, we can optional test the Levenshtien distance between the two sRNAs forming the duplex candidate. If the sRNAs are sufficiently different (the threshold value is set by this parameter) then the duplex is not considered for further analysis. Set to -1 to deactivate this feature. Default: -1
- duplex_file - The full path to the duplex zip file which is produced after stage 1 processing. This parameter is optional. If specified, and if "build_duplexes" is "false" then we can skip stage 1 processing and start from stage 2 instead.
- svm_model_file - The full path to the SVM model file.
- svm_scaling_file - The full path to the SVM scaling file associated with the model file.
- p_val_filter - Each result from the SVM will produce a probability value that it is a miRNA. To filter results that have less than a given threshold, set that value here. The minimum value this can be set to is 0.5, below which the SVM considers it more likely to be not a miRNA. Maximum is 1.0. Default: 0.5.
- out_dir - The full path to the directory that all output files will be written to.
- out_prefix - The prefix that will be applied to all output file names.
- loci_validator - It is possible to gauge the performance of miRPlex by comparing the results to known mature miRNAs and hairpins. Also if a genome is available then the sRNAs in the duplex can be aligned to the genome and assessed based on their genomic distance. To perform this validation step, set this parameter to "true". Default: false
- loci_known_mirnas - The full path to a fasta file containing known miRNA mature sequences in DNA form.
- loci_known_hairpins - The full path to a fasta file containing known miRNA hairpin sequences in DNA form.
- loci_genome - The full path to a genome file to align duplexes to.
- loci_max_distance - The maximum allowed genomic distance for sRNAs in a duplex. Above this distance, the duplex is considered invalid. Default: 400
In stage 1, a key step is filtering sRNAs into two groups, one group is for candidate mature seqs, the other is for candidate star seqs. The filtering is accomplished through the UEA sRNA filter tool. The filtering is described through two filter parameter files, the paths to which are defined by the "mature_filter_params" and "star_filter_params" parameters respectively. The filter parameters files have the same syntax as the miRPlex parameters file although the supported parameters differ. The parameters are as follows:
- make_nr - If set to "true" then the output is in non-redundant format. i.e. duplicate sRNAs are stored only once, and the number of duplications (the abundance) are stored via an integer. Default: true. Please leave this property set to true for use in miRPlex, otherwise the program may fail and will run much slower and use more memory.
- filter_low_comp - If set to "true" then low-complexity (sequences containing less than 3 distinct nucleotides) sequences are removed. Default: true. Please leave this property set to true for use in miRPlex, otherwise the program may fail and will run much slower and use more memory.
- min_length - Minimum length of the sRNA sequence, shorter sequences are discarded. Default: 16; Min: 16; Max: 49.
- max_length - Maximum length of the sRNA sequence, longer sequences are discarded. Default: 35; Min: 17; Max: 50.
- norm_abundance - Will first normalise the sRNA dataset, so that each sRNAs abundance is represented as reads per million (RPM). Optional. Default: false.
- min_abundance - Minimum abundance of the sRNA sequence, sequences with less abundance are discarded. Optional.
- max_abundance - Maximum abundance of the sRNA sequence, sequences with more abundance are discarded. Optional.
- watchlist - The full path to a file containing reads that if detected in the sRNA dataset are to be discarded. Reads in the files are stored in FASTA format. Optional.
- trrna - Whether to filter sRNAs that match to known transfer or ribosomal RNA. Optional. Default: false.
- trrna_sense - Whether to search for sRNAs that match to known t/rRNAs only on the sense strand. Only relevant if "trrna" is set to true. Default: false.
- filter_invalid - Whether to filter sRNAs that contain invalid characters. Default: false. NOTE!!!: Please set this property to true for use in miRPlex, otherwise the program may fail.
- genome - The full path to a fasta format genome file. sRNAs that do not map to this genome are discarded. Optional.
Debugging
Should error messages occur, they should contain sufficient information to what caused the problem and how to rectify the error. However, if it is not possible to determine the root cause of the problem from the error message alone a logging system is provided that should assist with the diagnosis. The program logs are stored in the ./User/logs directory. There will be several logs generated, which are described below:
- exe_manager_server: miRPlex utilises a second process to make calls out to external binaries. This is to avoid memory issues on the unix platform, which duplicates memory space from the host process for the spawned process. This becomes a problem for programs for large memory requirements. The exe_manager circumvents this problem by running with a small memory footprint, and communicates with the host process using network communication. When processing large chunks of data that can be multi-threaded it is necessary to also produce a new exe_manager process paired to each thread. This results in a separate exe_manager log for each exe_manager instance.
- workbench: miRPlex utilises much of the infrastructure used for the UEA sRNA Workbench. All log information from miRPlex, or supporting Workbench infrastructure, is stored in these logs.
The log files are broken down further into different levels:
- error: contains severe or critical information that has occurred during the run.
- info: Any messages stored in "error", plus general runtime information.
- debug: Only produced in verbose mode, will contain all messages in "info" and detailed runtime information describing the systems progress.
License
sRNA-WorkbenchEULAR (pdf 31 KB).
Disclaimer
The UEA SRNA Workbench Software is supplied "AS IS" and all use is at own risk. THE licensor disclaims all warranties of any kind , either express or implied, as to the UEA SRNA workbench software, including but not limited to ,implied warranties of fitness for a particular purpose, merchantability or none infringement of proprietary rights. neither this agreement nor any documentation furnished under it is intended to express or imply any warranty that the operation of the UEA SRNA workbench software will be uninterrupted, timely or error-free.
References
Please cite the following paper if you use miRPlex:
Mapleson, D.; Moxon, S.; Dalmay, T.; Moulton, V., (2012) MirPlex: A tool for identifying miRNAs in high-throughput sRNA datasets without a genome. Journal of Experimental Zoology: Part B"
Research team
Dr. Daniel Mapleson, Dr. Simon Moxon, Dr. Tamas Dalmay, Prof. Vincent Moulton