|-----------------------------------------| | Specification of the GTrack file format | |-----------------------------------------| GTrack version: 1.0 Document version: 1.0 Date: 23 Dec 2011 Authors: Sveinung Gundersen, Matus Kalas, Osman Abul, Arnoldo Frigessi, Eivind Hovig, Geir Kjetil Sandve ---------------- Contents ---------------- * Reading the specification * What is GTrack? * Example GTrack files * Basic specification i. Comments 1. Header lines 2. Column specification line 3a. Bounding region specification line 3b. Data lines - BED compatibility - Compression - Detailed specification of character usage * Extended specification - Redefining column names - WIG compatibility - FASTA compatibility - Defining GTrack subtypes * References * Change log --------------------------------- Reading the specification --------------------------------- This document contains the complete specification of the GTrack format. As the document contains many details, we here present some reading recommendations: - Skip the "Developer notes" sections if you are not planning to develop parsers of the GTrack format. - The "Restrictions" section after each main type of GTrack lines contains detailed descriptions that can be skipped by most readers. - The section "Detailed specification of character usage" contains very detailed information and can be skipped by most readers. - The sections under "Extended specification" describes extensions that do not add any new types of information to a GTrack file, only alternative ways of expressing the same information, in addition to functionality for defining GTrack subtypes. These sections are thus not required for basic use. A HTML version of this specification is available at [3]. In the HTML version, the sections described above are hidden by default. ----------------------- What is GTrack? ----------------------- GTrack is short for both "Genomic Track" and "Generic Track". GTrack is a general purpose, tabular file format for representing data in the form of genomic tracks, that is, as elements associated to positions along a reference (genome) sequence, or a set of sequences. GTrack emphasizes preciseness, flexibility, and simple parsing. This is achieved by allowing flexible column specification and declaring syntactic properties at the beginning of the file (allowing parsers to cleanly restrict support to a subset of the GTrack specification). A main contribution by the format is the unified and optimized formalization of sequence level genomic data into one of fifteen track types, as developed in [1]: Points (P) Valued Points (VP) Segments (S) Valued Segments (VS) Genome Partition (GP) Step Function (SF) Function (F) Linked Points (LP) Linked Valued Points (LVP) Linked Segments (LS) Linked Valued Segments (LVS) Linked Genome Partition (LGP) Linked Step Function (LSF) Linked Function (LF) Linked Base Pairs (LBP) These fifteen track types encompass most of the existing file formats, while providing support for, among other things, genomic data of a three-dimensional nature. The primary goals of the GTrack format are to support all track types systematically, simplify parsing and manipulation, allow custom extensions, and provide efficient storage. ---------------------------- Example GTrack files ---------------------------- Before delving into the details, it is recommended that you examine these examples of simple GTrack files. You may return to them while reading the rest of the specification, if needed. The first example is the simplest version of GTrack, without any specification lines. It shows a data set of a couple of genomic segments, and the track type is simply Segments (S). # # GTrack example file 1 # # A GTrack file without headers is handled as three-column BED [2] # chr1 121 201 chr2 486 1240 The second example contains all GTrack specification lines (header line, column specification line and bounding region specification line) and shows a dataset of genomic segments with additional associated information in extra columns. One of these is selected as the main "value" of the segments, which are then of type Valued Segments (VS). The example also shows how to add custom columns. # # GTrack example file 2 # # Note: tech is a custom column and not part of the GTrack specification # ##Track type: valued segments ###seqid tech start end value strand ####genome=hg19 chr1 ChIP-seq 1047 1165 0.625 - chr2 ChIP-chip 2002 2450 . + chr2 ChIP-chip 3033 3246 0.355 + The third example is more advanced, showing a Step Function dataset, that is, a dataset where every base pair in the domain has an associated value, but where this value is constant, or approximated, over larger regions (250-500 bps). The domain is, in this case, composed of two bounding regions. In addition, some of the regions are linked by edges to other regions in the genome. This example file is thus of type Linked Step Function (LSF). # # GTrack example file 3 # ##Track type: linked step function ##Edge weights: true ##Undirected edges: true ###id end value edges ####seqid=chr1; start=1000; end=2250 1 1250 10 4=0.4 2 1500 7 . 3 2000 2 . 4 2250 6 1=0.4;6=0.3 ####seqid=chr1; start=3000; end=4000 5 3250 7 . 6 3500 4 4=0.3 7 4000 6 . (Note that, for readability issues, spaces are used instead of tab characters in these example files. They will therefore not work "out of the box". All example files are available as working GTrack files from [3].) --------------------------- Basic specification --------------------------- GTrack is a tabular text file format. All GTrack filenames should end with ".gtrack". The GTrack format consists of 5 different line types, distinguished by the leading characters and numbered here by order of appearance in the file: i. Comments 1. Header lines 2. Column specification line 3a. Bounding region specification line 3b. Data lines Note: The number preceding each line type defines the order in which the lines must be present, i.e. column specification must follow the header lines, but comments may be present anywhere. Note that a bounding region specification line must be followed by a data line, but that a file may have multiple bounding region specifications with data lines in between. A GTrack validator is available at [3]. ----------- i. Comments ----------- - Leading characters: # - Example #This is a comment! - Usage: Optional Comments are ignored by parsers and may be present anywhere in the file. --------------- 1. Header lines --------------- - Leading characters: ## - Format ##VARIABLE:[ ]*VALUE where VARIABLE = Header variable name [ ]* = Optional space characters VALUE = Header variable value - Example ##gtrack version: 1.0 ##track type: valued points ##value type: category ##1-indexed: False ##end inclusive:True - Usage Optional, but any header variables not declared regain their default values. - Restrictions * GTrack files may add custom header variables, e.g. as part of the definition of a GTrack subtype (see section "Defining GTrack subtypes"). For reserved header variables, however, the values are restricted to the ones allowed by the header variable (see below). * All variable names and reserved variable values are treated as case insensitive and do not support character escaping. Custom values, i.e. header values of non-reserved header variables, do, however, support escaping. For more details, see the section "Detailed specification of character usage". Header lines provide structural information readable by both humans and automatic parsers. The GTrack format defines a reserved set of header variables, each with a default value. If a header variable is not declared in the header lines, the default value is used. We encourage the use of header lines even when they contain default values as this adds to the clarity of the file and helps reduce parsing errors. The order of the header lines is unimportant. Developer notes --------------- As not all parsers/tools will have the need to support the full GTrack specification, developers are welcome to support only subsets. We do, however encourage all GTrack parsers to always check the GTrack header lines and give feedback to the user if a particular feature is unsupported by the parser/tool. Note that non-reserved header lines should be ignored by parsers, unless they specifically support the particular extensions. We encourage parsers to print warning outputs for any unsupported, non-reserved header lines, as they may be a result of typing errors. Note also that, for consistency, the default values will not change in future versions of the GTrack specification. --------------- Reserved header variables ------------------------- - GTrack version The version of the GTrack specification used for the file. Default value: 1.0 - Track type* one of: points valued points segments valued segments genome partition step function function linked points linked valued points linked segments linked valued segments linked genome partition linked step function linked function linked base pairs Defines the track type of a GTrack file. Each track type defines a set of core columns to be used. See the section "Column specification line" for more details. Default value: segments - Value type one of: number binary character category Only used if the "value" column is defined. Defines the kind of content accepted in the value column. See the section "Column specification line" for more details. Default value: number - Value dimension one of: scalar pair vector list Only used if the "value" column is defined. Defines the dimension of the content accepted in the value column. See the section "Column specification line" for more details. Default value: scalar - Undirected edges* Only used if the "edges" column is defined. True if all edges specified in the GTrack file are undirected, else false. Note that undirected edges between two track elements must still be specified in both data lines, using the same weights. Default: false - Edge weights* Only used if the "edges" column is defined. True if weights are specified for edges, else false. If true, all edges must have a weight specification, if false, no edges must specify weight. Default value: false - Edge weight type one of: number binary character category Only used if the "edges" column is defined and the "Edge weights" header variable is set to "true". Defines the kind of content accepted as edge weights. See the section "Column specification line" for more details. Default value: number - Edge weight dimension one of: scalar pair vector list Only used if the "edges" column is defined and the "Edge weights" header variable is set to "true". Defines the dimension of the content accepted as edge weights. See the section "Column specification line" for more details. Default value: scalar - Uninterrupted data lines* True if it is guaranteed that the data lines are not interrupted by bounding region specification lines (i.e. that more than one bounding region is specified), comments or blank lines, else false. This is used to help simple parsers. Default value: false - Sorted elements* True if it is guaranteed that all bounding regions and track elements come in sorted order. Bounding regions must be sorted first, and the track elements in each bounding region block second. Regions are sorted by the following fields, in ascending order (using only the ones that are defined): genome, seqid, start, end. Default: false - No overlapping elements* Only used for tracks of type Points and Segments, and the variations of these, i.e. Linked and/or Valued Points (VP/LP/LVP) and Linked and/or Valued Segments (VS/LS/LVS). True if it is guaranteed that no two track elements overlap, else false. Default: false - Circular elements* True if any track element or bounding region cross the coordinate borders of a circular sequence, i.e. that the "end" value is smaller than the "start" value. Default: false - 1-indexed True if the coordinates start at 1, false if the coordinates start at 0. Default value: false - End inclusive True if the chromosome coordinate specified in the end column is included in intervals, else false. Default value: false Developer notes --------------- We recommend that all parsers always check the values of the header variables '1-indexed' and 'end inclusive', even if only one or some settings are supported by the parser. If the values defined in a GTrack file are unsupported, the parser should fail. This greatly reduces the risk of erroneous positional information. --------------- (Note that the section "Extended specification" includes more reserved header variables.) * Some header lines include redundant information compared to the rest of the file. These are marked with * in the listing above. The redundant header lines are still explicitly defined for several reasons. First, in order for a human reader to easily find out which features are used in a file. Second, as a way for simple parsers that only use a subset of the specification to check whether they can parse a particular file. Third, it enables automatic validation of whether a file contains the information in the way the author intended. These header lines can be automatically extracted from the rest of a GTrack file by the "Expand GTrack headers" tool, available at [3]. ---------------------------- 2. Column specification line ---------------------------- - Leading characters: ### - Format ###COL1 COL2 COL3... where COL1, COL2, COL3 = Column names " " = tab character - Example ###genome seqid start end strand geneId score id edges (with tabs instead of spaces) - Default value ###seqid start end (with tabs instead of spaces) - Usage Optional, but if not defined, retains the default value. - Restrictions * Column names are treated as case insensitive and do not support character escaping. For more details, see the section "Detailed specification of character usage". * All column names must be unique. The column specification line is a tab-separated list of column names. The GTrack specification defines a set of eight reserved column names. Four of these are associated with the four core informational properties: gaps, lengths, values and interconnections. The specific set of core columns present defines the track type (see [1] for more details). The GTrack format also defines 4 reserved columns that, although they do not define track type, have reserved meanings. The associations between the reserved columns and track types are shown in the following table: Column name: genome seqid start end value strand id edges Type of column: N N C C C N N C Track type: Points (P) ? ! X . . ? ? . Segments (S) ? ! X X . ? ? . Genome Partition (GP) ? ! . X . ? ? . Valued Points (VP) ? ! X . X ? ? . Valued Segments (VS) ? ! X X X ? ? . Step Function (SF) ? ! . X X ? ? . Function (F) ? ! . . X ? ? . Linked Points (LP) ? ! X . . ? X X Linked Segments (LS) ? ! X X . ? X X Linked Genome Partition (LGP) ? ! . X . ? X X Linked Valued Points (LVP) ? ! X . X ? X X Linked Valued Segments (LVS) ? ! X X X ? X X Linked Step Function (LSF) ? ! . X X ? X X Linked Function (LF) ? ! . . X ? X X Linked Base Pairs (LBP) ? ! . . . ? X X C - Core reserved column (defines track type) N - Non-core reserved column (reserved, but does not define track type) X - Column is mandatory ? - Column is optional . - Column is not allowed ! - Property must be present, either as a column or in a bounding region specification (see below) Table 1: Overview of the eight reserved columns in the GTrack format and their associations to track type. Reserved columns ---------------- - genome The genome assembly of the track element (e.g. hg19, mm9). The GTrack format has no explicit requirements on the syntax or semantics of the genome specification; the interpretation is up to the particular parsers/tools. Elements from different genomes are allowed in the same GTrack file. Specifying the genome of a track element is optional. The genome may be specified either as a separate column in the data lines, or in a preceding bounding region specification line (see below), or both. If genome is specified both in a bounding region specification and as a column, the values must be equal. - seqid A sequence identifier, i.e. an identifier of the underlying sequence of the particular track element. Usually defined as chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671), as defined in the genome assembly. As with the "genome" column, the GTrack format has no explicit requirements on the syntax or semantics of the "seqid" column; the interpretation is up to the particular parsers/tools. Some parsers may for instance allow chromosome arms (e.g. chr1p) as seqid. All track elements in a GTrack file must have a seqid, either as a separate column in the data lines, or in a preceding bounding region specification line (see below), or both. If seqid is specified both in a bounding region specification and as a column, the values must be equal. - start The start position of the track element, using the indexing system defined in the header (0- or 1-based). Developer notes --------------- The start column is not defined for some track types (as described in Table 1). In order to still work on the start position of an element, it has to be inferred from other information in the following manner, according to track type: Genome Partition (GP), Step Function (SF), Linked Genome Partition (LGP) and Linked Step Function (LSF): The start position of each track element can be seen as the position immediately following the end of the track element of the previous line. The exact value of the start position depends on the "End inclusive" header variable, i.e. if the coordinates are end-exclusive, the start position of one track element should be exactly the same as the end position of the previous line, if not, the start position should be set to the previous end position + 1. For the first line in a set of data lines, the start position should be set to the start position of the preceding bounding region (see section "Bounding region specification line"). Function (F), Linked Function (LF) and Linked Base Pairs (LBP): Each line defines a successive location along the genome. The start of the first line in a set of data lines is then the start position of the preceding bounding region. The start value is then increased by 1 for each line. --------------- - end The end position of the track element, using the indexing system (0- or 1-based) and "end inclusive" property as defined in the header. Developer notes --------------- The end column is not defined for some track types (Points (P), Valued Points (VP), Function (F), Linked Points (LP), Linked Valued Points (LVP), Linked Function (LF) and Linked Base Pairs (LBP), as described in Table 1). In order to still work on the end position of an element, it has to be inferred from the start position. In these cases, the end position depends on the "End inclusive" header variable. If true, the end position is the same as the start position, if false, the end position is the start position + 1. --------------- - strand The strand of the track element. "+" for positive, "-" for negative strand, and "." when strand information is missing or irrelevant. - value The value or score of the track element. The character "." denotes that the track element has a missing value. The basic type of the contents follow the "Value type" header variable as follows: number One floating point number, e.g. -1.23, 12 or 3.1e-4. English decimal notation is used, including scientific e notation, with the period character representing the decimal separator, but with no spacing. Note that integer numbers are a subset of floating point numbers, and should use "number" as the value type. binary One binary value. If this value is used to denote case and control, the following notation must be used: 1 for case, 0 for control. character One ASCII character, e.g. A, T, C. See the section "Detailed specification of character usage" for restrictions. category A string defining a category. The set of all category values over all track elements form a category set, e.g: {gene, exon, promoter}. See the section "Detailed specification of character usage" for restrictions. In addition, the "value dimension" header variable may define that the value contains more than one instance of the basic value type, as follows: list A list of values, following the basic type defined in the "Value type" header variable. Lists of numbers and categories are delimited by comma, e.g. 1.23,2.34,3.45,4,5 or exon,gene,CDS,gene. Lists of binary values and characters use no delimiter, e.g. 1011011010 or ATGCTCGACG. Lists that combine different basic types are not allowed. The length of lists may vary between track elements. The missing element character, ".", is allowed in lists. vector A vector of values, similarly defined as a list, with the only difference that vectors must have the same length throughout the GTrack file. pair A pair of values, similarly defined as a vector, with the only limitation that the length is exactly 2. scalar A single value, following the basic type defined in the "Value type" header variable, e.g. 1.23, 0, g or exon, respectively. Developer notes --------------- Note that the different dimensions are defined in a hierarchical manner: lists > vectors > pairs & scalars. All scalars or pairs are also vectors of length 1 or 2, respectively, and all vectors are lists. Support for lists in a parser should then also lead to the support of its "sub-dimensions", given, of course, that the analysis allows that they are treated in an equal fashion. --------------- - id An unique string identifying each track element (data line). Can be in any format, e.g. 1, aab or uc002ico.1. See the section "Detailed specification of character usage" for restrictions. - edges A semicolon-separated list of id's, representing edges from the track element in the current line to the track elements which the id's identify. A "." character denotes that the track element has no edges. An edge is by default directed. If the header variable "Edge weights" is set to "true", each edge must have a weight value directly following, after an equals sign. The format of the weight value follows the "Edge weight type" and the "Edge weight dimension" header variables in the same way as the "value" format follows the "Value type" and "Value dimension" header variables (see above). Note that no space characters are allowed after the semicolon. Example: ###seqid start end id edges chr1 0 100 aaa aab=1.2;aac=. chr1 200 350 aab aaa=1.1 chr1 450 500 aac . Here, the aaa node is connected to the aab node with two directed edges, with the edge from aaa to aab having higher weight than the one in the other direction. Note that undirected edges must still be specified in both directions, using the same weights. This adds redundancy, but simplifies parsing. If all edges in a GTrack file are undirected, the header variable "Undirected edges" should be set to "true". -------------------------------------- 3a. Bounding region specification line -------------------------------------- - Leading characters: #### - Format Type A) ####genome=VAL1 or Type B) ####[genome=VAL1;[ ]*]seqid=VAL2[;[ ]*start=VAL3][;[ ]*end=VAL4] where [x] = 'x' is optional [ ]* means optional space characters genome, seqid, start, end = reserved attribute names VAL1, VAL2, VAL3, VAL4 = attribute values - Example ####genome=hg18; seqid=chr1;start=100; end=10000 - Usage Type B is mandatory for GTrack files of one of the following track types: Genome Partition (GP) Step Function (SF) Function (F) Linked Genome Partition (LGP) Linked Step Function (LSF) Linked Function (LF) Linked Base Pairs (LBP) For all other track types, bounding region specification lines are optional. - Restrictions * Attribute names are treated as case insensitive and do not support character escaping. Genome and seqid values do, however, support escaping. For more details, see the section "Detailed specification of character usage". * A bounding region specification remains in effect for a set of data lines until the next bounding region specification. * If a GTrack file contains any bounding regions, then all elements must be enclosed by one. * Bounding regions are not allowed to overlap. * Bounding regions of type A and B are not allowed in the same GTrack file. * No data lines following a bounding region of type B may have start or end positions defined outside the bounding region * For track types Genome Partition (GP), Step Function (SF), Linked Genome Partition (LGP) and Linked Step Function (LSF), the "end" attribute must be equal to the end position of the last track element of the block of data lines immediately following the bounding region specification line. Example: ##track type: genome partition ###end ####seqid=chr1; start=100; end=200 125 133 200 * For track types Function (F), Linked Function (LF) and Linked Base Pairs (LBP), the "end" attribute must be exactly equal to the "start" attribute plus the number of data lines immediately following the bounding region specification line. If the header line "End inclusive" is true, the end position should be 1 less. Example: ##track type: function ###value ####seqid=chr1; start=100; end=103 1.2 -0.1 0.8 A bounding region specifies a genomic interval encompassing the data lines that follow. A bounding region should be thought of as constituting the domain of the following track elements, i.e. the region where we have information about the properties modeled by the track elements. The set of all bounding regions of a track then constitutes the domain of the track. Note that, in the case of Points and Segments (and the variations of these, i.e. Linked and/or Valued Points (VP/LP/LVP) and Linked and/or Valued Segments (VS/LS/LVS), see Table 1), lack of elements is also considered information. A bounding region is then, in this case, a region where we know that the lack of data means something. Areas of the genome that has not been investigated (such as centromeres) should be left outside the bounding regions. For track types other than Points and Segments (and their variations), the track elements do by definition fill the entire domain. For example, a Function has, by definition, a value for all base pairs in the domain. A bounding region is then just the smallest region encompassing the track elements that follow. For more details, see [1]. The bounding region specification comes in two flavors: A) The bounding region specifies the genome assembly for the following track elements, using the same format as for the "genome" column (see the "Column specification line" section). The domain of the track is then the set of sequences constituting the genome, e.g. all chromosomes of the genome. If a track contains several genomes, the domain of the track is the collected set of sequences constituting all the specified genomes. B) The bounding region specifies a single sequence, or part of this sequence, as the domain of the following track elements. The format is a set of attribute pairs separated by semicolon and optional space characters. For each attribute pair, the attribute name and the value are separated by the equals sign. The attributes may appear in any order. The allowed attributes are the following: - genome The genome assembly of the bounding region(e.g. hg19, mm9). The format of the genome attribute is the same as for the "genome" column (see the section "Column specification line"). The "genome" attribute is optional. - seqid A sequence id, e.g. the id of the underlying sequence of the bounding region. The format of the seqid attribute is the same as for the "seqid" column (see the section "Column specification line"). The "seqid" attribute is mandatory for a bounding region specification line of type B. Note that if type B bounding region specifications are not defined, the "seqid" column must be included in the column specification line. - start The start position of the bounding region, using the indexing system defined in the header (0- or 1-based). The "start" attribute is optional. Developer notes --------------- If the "start" attribute is not specified, the start position of the bounding region is 0 (or 1, if the header variable "1-indexed" is true). --------------- - end The end position of the bounding region, using the indexing system (0- or 1-based) and "end inclusive" property as defined in the header. The "end" attribute is optional. Developer notes --------------- If the "end" attribute is not specified, the end position of the bounding region is the same as the end position of the sequence referenced by the 'seqid' attribute, e.g. the length of the current chromosome. If the parser does not have information about the length of the sequence in question, the user should be informed, or, in the case that the bounding region is unimportant for the parser, the bounding region specification should be ignored. Note that the restrictions regarding the "end" attribute for certain track types (see section "Restrictions" above) must still hold, even if the "end" attribute is not explicitly specified. --------------- -------------- 3b. Data lines -------------- - Leading characters: - Format VAL1 VAL2 VAL3... where VAL1, VAL2, VAL3 = column values " " = tab character - Example chr21 304 997 - FOOGENE 423 1 . (with tabs instead of spaces) - Usage Data lines are optional. - Restrictions * Column values support character escaping, as specified in the section "Detailed specification of character usage". * The number of columns of each data line must be equal to the number of columns in the column definition line. * For track types Genome Partition (GP), Step Function (SF), Linked Genome Partition (LGP), and Linked Step Function (LSF), the data lines in each bounding region block must be sorted on the "end" value, in ascending order. Each data line is a tab-separated list of values, as defined by the column definition line. If there is a missing value in either of the "value" and "edges" columns, the period character, ".", may be used. See the section "Column specification line" for more details. ----------------- BED compatibility ----------------- Note that a simple BED file only using the three columns chr, start and end is directly compatible with the GTrack format. This is because the default track type of a GTrack file is Segments (S), which defines the same three core columns as a simple BED file (see Table 1). One may thus only rename the file ending of such a file from ".bed" to ".gtrack" and run it through a GTrack parser. If a UCSC custom track definition line or other headers are present, they must be commented out. More complex BED files must be converted. Converters to common file formats are available at [3]. ----------- Compression ----------- As genomic tracks may contain large amounts of data, we require that fully compliant GTrack parsers support the expansion of tabular files compressed with the gzip compression algorithm [4]. Such GTrack files should have the suffix ".gtrack.gz". ----------------------------------------- Detailed specification of character usage ----------------------------------------- - The GTrack format supports escaping of special characters using URL escaping conventions (%XX hex codes). All ASCII characters are supported, except the following, which must be escaped everywhere: Most control characters (except TAB, LF, CR): %00-%08, %0B-%0C, %0E-%1F, %7F Extended ASCII characters: %80 through %FF Also, the following characters have reserved meaning, and must be escaped when used with other meanings in places where they may interfere with the parsing: tab (TAB): %09 newline (LF): %0A carriage return (CR): %0D space: %20 # (hash): %23 % (percent): %25 , (comma): %2C ; (semicolon): %3B = (equals): %3D . (period): %2E Note that spaces needs not be escaped in data lines, as the data values are separated by tabs. - Reserved phrases in a GTrack file receive special treatment. Reserved phrases include all header variable names, reserved header variable values (excluding custom header variable values), column names (including custom columns) and bounding region attribute names. Reserved phrases should be treated as case insensitive and do not support URL escaping. - One must in all cases avoid starting or ending a value with unescaped whitespace. - A line must end with the newline character (LF), optionally preceded by a carriage return (CR). - Blank lines should be ignored by parsers. - Comments, header lines, column specification lines and bounding region specification lines are characterized by the leading number of #-characters. Note that, except for comments, once the file reaches a certain "level" of #-characters, this count never goes down. Thus, header lines, column specification and bounding region specifications are always found in that order. - Note that delimiter characters differ for the various lines/columns. See the specification above for details. Also note that examples in this file use spaces instead of tabs for readability. These examples should not be directly copied into GTrack files. ------------------------------ Extended specification ------------------------------ The extended part of the GTrack specification consists of the following header variables: - Value column - Edges column - Fixed length - Fixed gap size - Fixed-size data lines - Data line size - GTrack subtype - Subtype version - Subtype URL - Subtype adherence These header variables are redundant compared to the basic GTrack specification, that is, they do not allow any extra types of information to be represented. They do, however, allow existing information to be represented in more practical ways, in addition to supporting standardized ways of extending the GTrack format by defining GTrack subtypes. ----------------------- Redefining column names ----------------------- A GTrack file may contain several columns that could be used as the 'value' column, and similarly for the 'edges' column. To change which columns are used, one must, as described in the basic GTrack specification, modify the column specification line. The following header variables may, however, simplify the process. - Value column The name of the column to be used for as the 'value' column. Default: value - Edges column The name of the column to be used for as the 'edges' column. Default: edges Note that if either of these header variables has a non-default value, the corresponding default value ('value' or 'edges') must not be included in the column specification line. The following example is thus an incorrect GTrack file: ##track type: valued segments ##value column: score ###seqid start end value score chr1 0 50 1.0 0.9 chr1 100 125 1.1 0.8 The following file does, however, follow the GTrack specification: # # GTrack example file 4 # ##track type: valued segments ##value column: score2 ###seqid start end score1 score2 chr1 0 50 1.0 0.9 chr1 100 125 1.1 0.8 Developer notes --------------- The 'value column' and the 'edges column' header variables should be interpreted prior to parsing the column specification line. The column name referred to by the variable(s) should be renamed to 'value' or 'edges', respectively. If two columns in this way ends up with the same name, the parser should return an error. In this way, a parser that does not support the 'value column' and 'edges column' header variables will issue an error when a properly specified GTrack file with such headers are parsed, as, in that case, the track type will not match the column specification line according to table 1. Parsing errors are recommended over incorrect analysis results caused by erroneous interpretation of columns. --------------- ----------------- WIG compatibility ----------------- The WIG format [6] includes the parameters 'step' and 'span', specifying a fixed step size, i.e. the distance between start positions, and a fixed span size, i.e. the length of track elements, respectively. Consider for instance the following WIG file: fixedStep chrom=chr1 start=201 step=100 span=50 25.0 26.0 fixedStep chrom=chr2 start=151 step=100 span=50 10.0 11.0 A GTrack version of this file, using the basic specification, would look something like this, using three columns instead of one: # # GTrack example file 5A # ##Track type: valued segments ##1-indexed: true ##End inclusive: true ###start end value ####seqid=chr1 201 250 25.0 301 350 26.0 ####seqid=chr1 151 200 10.0 251 300 11.0 In order to support WIG-like functionality in GTrack, the following header variables may be used: - Fixed length Only used when the end column is not specified. Defines a fixed length for all elements in the GTrack file. Restrictions: * fixed length >= 1 Track type dependency: When fixed length > 1, the track type should be determined as though the end column is present (see Table 1). Default: 1 Developer notes --------------- Contrary to the restrictions of bounding regions of type B (see above), the end position of the segments in a bounding region is allowed to cross the region border, if implicitly defined by the "fixed length" header variable. Depending on the application, the parser must decide whether to crop the length of the elements, i.e. set the end position of any elements crossing the region border (typically the last element) equal to the end position of the surrounding bounding region. --------------- - Fixed gap size Only used when neither the start nor the end column is specified. Defines fixed-size gaps between all neighboring elements in the same bounding region. Gap size is defined as the number of uncovered base pairs between the elements. The following equation defines the relation between length, gap size and start positions: start_n+1 = start_n + fixed length + fixed gap size where 'start_n+1' is the start position of a track element immediately following an element with start position 'start_n' in the same bounding region. Restrictions: * fixed length + fixed gap size > 0 * Only allowed in GTrack files using bounding regions of type B (see section "Bounding region specification line"). Track type dependency: When fixed gap size != 0, the track type should be determined as though the start column is present (see Table 1). Default: 0 To convert from a WIG file to a GTrack file, one may use the following formulas: fixed length = span fixed gap size = step - span The WIG file shown above may then be represented in the following way as a GTrack file: # # GTrack example file 5B # ##Track type: valued segments ##1-indexed: true ##End inclusive: true ##Fixed length: 50 ##Fixed gap size: 50 ###value ####seqid=chr1; start=201 25.0 26.0 ####seqid=chr2; start=151 10.0 11.0 Note that the definitions above allow negative values for the variable 'fixed gap size'. Such values may be used to represent sliding windows, i.e. segments that overlap with a fixed number of base pairs. ------------------- FASTA compatibility ------------------- The following header variables may be used to represent FASTA-like sequences [5], and other simple function tracks, such as GC content, in a condensed manner. Consider a GTrack file of type 'Function', with only the value column specified: # # GTrack example file 6A # ##Track type: function ##Value type: character ###value ####seqid=seq001 A G C ####seqid=seq002 G G This is a valid GTrack file according to the basic specification. However, reading a sequence using only one nucleotide per line is quite impractical. The following header variables changes the interpretation of the data lines: - Fixed-size data lines True if each data line has an exact size in terms of number of characters. This is only allowed for track type Function (F), and only if the only column specified is "value". Newline and carriage return characters are ignored when parsing, and the data lines are separated using the number of characters specified in the header variable "Data line size" (below). Developer notes --------------- Note that parsers still need to be able to recognize bounding region specification lines. --------------- Default: false - Data line size The size of each data line in terms of number of characters. Is only used if the header variable "Fixed-size data lines" (above) is true. Default: 1 Using these header variables, the example GTrack file shown above can be expressed in the following way: # # GTrack example file 6B # ##Track type: function ##Value type: character ##Fixed-size data lines: true ##Data line size: 1 ###value ####seqid=seq001 AGC ####seqid=seq002 GG ------------------------ Defining GTrack subtypes ------------------------ The GTrack format includes support for defining GTrack subtypes, that is, file formats that adhere to only a subset of the GTrack specification. This allows implementation of more specialized parsers, while at the same time ensuring that subtype GTrack files still work with fully compliant GTrack parsers. GTrack subtypes may also be used to standardize special GTrack configurations, removing the need for the individual GTrack files to include all the required meta information. We encourage independent specification of subtypes catering to specialized needs. A GTrack subtype defines default values for header variables and/or the column specification line. A subtype may also add new header variables or define how parsers should interpret the values of any non-reserved columns. GTrack subtypes must still conform to the GTrack specification. Interpretation of new columns or header lines do of course require specialized parsers. Example #1: FASTA ----------------- As an example of the use of subtypes, we show how GTrack can be used in a similar manner as conventional FASTA files [5] (see the section "FASTA compatibility" above). Example file 7A is the subtype specification file: # # GTrack example file 7A # # Specification of FASTA subtype for GTrack. # Available at http://gtrack.no/fasta.gtrack # ##GTrack version: 1.0 ##GTrack subtype: FASTA ##Subtype version: 1.0 ##Subtype adherence: strict ##Track type: function ##Value type: character ##Fixed-size data lines: true ##Data line size: 1 ###value When using the subtype, an "online" parser will download the subtype specification file (over) and use the specified header values and/or column specification line instead of the GTrack default values. The header of a GTrack file adhering to the subtype may then be as simple as including the URL of the subtype specification, as in example file 7B: # # GTrack example file 7B # # This file makes use of the FASTA subtype specification. # ##Subtype URL: http://gtrack.no/fasta.gtrack ####seqid=seq0001 TAGACATTACCGCTAGGATGATGCGATCGATCGATCCCTCTGGATTAGGAGATCTCTAGATCGATGATATCCTCNN NNNNNNNATTGCTCTAGCTCTAGCTCTAGCT ####seqid=seq0002 GATTACATATCGCGATCGACTCGCCACTATAACTTCGAGTCTGACGATGATGGGGGGG GTrack subtype header lines --------------------------- Subtype functionality is applied with the following header variables: - GTrack subtype The name of the subtype of the GTrack format used for the file, if any. May be specified if a GTrack file conforms to a subtype, even if the header variable "Subtype URL" is not specified. Developer notes --------------- Custom parsers that only support certain subtypes should check this header and give feedback to users if the subtype is not correct. --------------- Default value: "" - Subtype version The version of the GTrack subtype. May be specified if a GTrack file conforms to a subtype, even if the header variable "Subtype URL" is not specified. Default value: 1.0 - Subtype URL URL to a GTrack file used as a specification/model for the GTrack subtype, if any. The subtype GTrack specification file is a normal GTrack file, but without bounding region specification lines or data lines. The header lines and/or the column specification line of a GTrack subtype model file is used instead of the default values for other GTrack files that adhere to the subtype. Any other specifications/restrictions should be included as comments. The "subtype URL" header variable is not allowed in GTrack subtype specification files. Developer notes --------------- If a GTrack file contains a Subtype URL header line, the subtype specification file should be downloaded by the parser. Incomplete URLs without a specified scheme (e.g. "gtrack.no") should be treated as HTTP-addresses (e.g. "http://gtrack.no"). Any inconsistencies between header lines of the GTrack files and the subtype headers should be treated according to the "subtype adherence" header variable (see below). If the header variables "GTrack subtype" or "Subtype version" (see below) in a GTrack file do not correspond to the same header variables in the subtype specification file, the user should be informed. It is then up to the parser to decide whether or not to continue parsing. If subtype specification downloading is not supported by the parser and a subtype URL is provided in the GTrack file, the user should be informed that he/she may use the "Expand GTrack headers" tool available at [3] in order to merge the subtype headers with the GTrack file for use in "offline" parsers. --------------- Default value: "" - Subtype adherence Subtype adherence may be specified in the subtype GTrack specification file and will then regulate the way a GTrack file may override the subtype specification. The subtype adherence may also be specified in a GTrack file, and will in this case function as a signal to parsers. In this way, different parsers may allow different levels of adherence for GTrack files of the same subtype. The following values are allowed: strict Values of header variables and the column specification line, as defined by the subtype, may not be overridden by the contents of a file. GTrack defaults may be overridden. This option may be used to force users of a subtype to follow the specification exactly. extensible As strict, but allows redefinition of the column specification line in one aspect: * any number of extra columns, including non-core reserved columns, may be added to the end of the column specification line. Adding core reserved columns is not allowed. This option may be used to allow users of a subtype to add their own content, while maintaining the exact interpretation of the first columns as defined by the subtype. redefinable As extensible, but allows redefinition of the column specification line in another aspect: * the "value" and "edges" columns may be redefined, i.e. any non-core column names may be renamed to "value" or "edges", and vice-versa, or the "value" and/or "edges" column may be added to the end of the column specification line. * correspondingly, the header lines "Track type", "Value type", "Value dimension", "Undirected edges", "Edge weights", "Edge weight type", "Edge weight dimension", "Value column" and "Edges column" may also be redefined by the GTrack file. This option may be used to allow users of a subtype to add their own content, including redefining the "value" and "edges" columns, while maintaining exactly the same content in the first columns as defined by the subtype. reorderable As strict, but allows redefinition of the column specification line in the following manner: * all columns specified in the subtype specification must be included, but can be put in any order, and any extra columns may be added. * correspondingly, the header line "Track type" may also be redefined by the GTrack file. Note that in this case, redefinition of the "value" or "edges" columns is not allowed, as in "redefinable", but a "value" or an "edges" column may be added, if not present. This restriction guarantees consistent indentification of columns by column name. This option may be used to allow users of a subtype to adopt their own column ordering, while at the same time maintaining that a minimum of columns must be present, identifiable by column name. free Everything is allowed, as long as the GTrack specification is followed. This option leads to the subtype specification being used for no more than an alternative definition of default values of the GTrack header lines and column specification line. Developer notes --------------- Note that if subtype adherence is specified in the subtype specification as anything other than "free", a GTrack file using the subtype specification may not redefine this value. --------------- Default value: free Example #2: Short reads ----------------------- As an extra example of the subtype functionality, we here propose a format for storing short reads (e.g. from ChIP-seq experiments). Again, example file 8A is the GTrack subtype specification file, and example file 8B is a GTrack file making use of the subtrack: # # GTrack example file 8A # # Specification of Short reads example subtype. # Available at http://gtrack.no/shortreads_example.gtrack # ##GTrack version: 1.0 ##GTrack subtype: Short reads example ##Subtype version: 0.9 ##Subtype adherence: redefinable ##Track type: segments ###seqid start end strand read quality # # Unmapped reads may be stored in comment lines at the end of the file, as # exemplified below. # # Unmapped reads: # # AGATAGATAGGATCCCAGCTGACT # AGTCCTCTAGCTCTGACTATC --- # # GTrack example file 8B # # GTrack file making use of the Short reads example subtype. # ##Track type: valued segments ##Subtype URL: http://gtrack.no/shortreads_example.gtrack ###seqid start end strand read value new chr1 101 111 + AGTAGATAGC 0.8 0 chr1 203 244 - 0:C;15:G 0.7 1 # # Unmapped reads: # # ATGAATATTAAAAATCTCCT # AGCGACCATACGTACATTACGAC The "Short reads example" subtype defines two extra columns, named "read" and "quality". A read is then either the exact read (using nucleotide symbols with the exact same length as the track element) or a semicolon-separated list of colon-separated mismatches, where a mismatch is represented by a relative position and a nucleotide symbol. The reference is here the genome assembly specified in the description lines. The relative positions should follow the indexing defined by the "1-indexed" header variable. The column quality contains the quality score of the read. According to the "redefinable" subtype adherence setting, adding columns to the end are allowed. In example file 7B, the "new" column is added. Also note that the "redefinable" setting allows the redefinition of any column as a "value" column, here the "quality" column. A set of basic GTrack subtypes are available from [3]. ------------------ References ------------------ [1] Gundersen S, Kalas M, Abul O, Frigessi A, Hovig E, Sandve GK: Identifying elemental genomic track types and representing them uniformly. In press. [2] http://genome.ucsc.edu/FAQ/FAQformat.html [3] http://www.gtrack.no [4] http://www.gzip.org [5] http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml [6] http://genome.ucsc.edu/goldenPath/help/wiggle.html ------------------ Change log ------------------ v1.0 - 2011.12.23: * First public version, included as "Additional file 1" in [1].