|-----------------------------------------|
| Specification of the GTrack file format |
|-----------------------------------------|

GTrack version: 1.0
Document version: 1.0
Date: 23 Dec 2011
Authors: Sveinung Gundersen, Matus Kalas, Osman Abul, Arnoldo Frigessi,
         Eivind Hovig, Geir Kjetil Sandve


----------------
    Contents
----------------

* Reading the specification
* What is GTrack?
* Example GTrack files
* Basic specification
    i.  Comments
    1.  Header lines
    2.  Column specification line
    3a. Bounding region specification line
    3b. Data lines
    -   BED compatibility
    -   Compression
    -   Detailed specification of character usage
* Extended specification
    -   Redefining column names
    -   WIG compatibility
    -   FASTA compatibility
    -   Defining GTrack subtypes
* References
* Change log


---------------------------------
    Reading the specification
---------------------------------

This document contains the complete specification of the GTrack format. As the
document contains many details, we here present some reading recommendations:

- Skip the "Developer notes" sections if you are not planning to develop parsers
  of the GTrack format.

- The "Restrictions" section after each main type of GTrack lines contains
  detailed descriptions that can be skipped by most readers.

- The section "Detailed specification of character usage" contains very detailed
  information and can be skipped by most readers.

- The sections under "Extended specification" describes extensions that do not
  add any new types of information to a GTrack file, only alternative ways of
  expressing the same information, in addition to functionality for defining
  GTrack subtypes. These sections are thus not required for basic use.
  
A HTML version of this specification is available at [3]. In the HTML version,
the sections described above are hidden by default.


-----------------------
    What is GTrack?
-----------------------

GTrack is short for both "Genomic Track" and "Generic Track". GTrack is a
general purpose, tabular file format for representing data in the form of
genomic tracks, that is, as elements associated to positions along a reference
(genome) sequence, or a set of sequences.

GTrack emphasizes preciseness, flexibility, and simple parsing. This is achieved
by allowing flexible column specification and declaring syntactic properties at
the beginning of the file (allowing parsers to cleanly restrict support to a
subset of the GTrack specification).

A main contribution by the format is the unified and optimized formalization of
sequence level genomic data into one of fifteen track types, as developed in
[1]:

Points (P)
Valued Points (VP)
Segments (S)
Valued Segments (VS)
Genome Partition (GP)
Step Function (SF)
Function (F)
Linked Points (LP)
Linked Valued Points (LVP)
Linked Segments (LS)
Linked Valued Segments (LVS)
Linked Genome Partition (LGP)
Linked Step Function (LSF)
Linked Function (LF)
Linked Base Pairs (LBP)

These fifteen track types encompass most of the existing file formats, while
providing support for, among other things, genomic data of a three-dimensional
nature. The primary goals of the GTrack format are to support all track types
systematically, simplify parsing and manipulation, allow custom extensions, and
provide efficient storage.


----------------------------
    Example GTrack files
----------------------------

Before delving into the details, it is recommended that you examine these
examples of simple GTrack files. You may return to them while reading the rest
of the specification, if needed. The first example is the simplest version of
GTrack, without any specification lines. It shows a data set of a couple of
genomic segments, and the track type is simply Segments (S).


#
# GTrack example file 1
#
# A GTrack file without headers is handled as three-column BED [2]
#
chr1  121  201
chr2  486 1240


The second example contains all GTrack specification lines (header line, column
specification line and bounding region specification line) and shows a dataset
of genomic segments with additional associated information in extra columns. One
of these is selected as the main "value" of the segments, which are then of type
Valued Segments (VS). The example also shows how to add custom columns.


#
# GTrack example file 2
#
# Note: tech is a custom column and not part of the GTrack specification
#
##Track type: valued segments
###seqid  tech       start  end   value  strand
####genome=hg19
chr1      ChIP-seq   1047   1165  0.625  -
chr2      ChIP-chip  2002   2450  .      +
chr2      ChIP-chip  3033   3246  0.355  +


The third example is more advanced, showing a Step Function dataset, that is, a
dataset where every base pair in the domain has an associated value, but where
this value is constant, or approximated, over larger regions (250-500 bps). The
domain is, in this case, composed of two bounding regions. In addition, some of
the regions are linked by edges to other regions in the genome. This example
file is thus of type Linked Step Function (LSF).


#
# GTrack example file 3
#
##Track type: linked step function
##Edge weights: true
##Undirected edges: true
###id  end   value  edges

####seqid=chr1; start=1000; end=2250
1      1250  10     4=0.4
2      1500  7      .
3      2000  2      .
4      2250  6      1=0.4;6=0.3

####seqid=chr1; start=3000; end=4000
5      3250  7      .
6      3500  4      4=0.3
7      4000  6      .


(Note that, for readability issues, spaces are used instead of tab characters in
these example files. They will therefore not work "out of the box". All example
files are available as working GTrack files from [3].)


---------------------------
    Basic specification
---------------------------

GTrack is a tabular text file format. All GTrack filenames should end with
".gtrack". The GTrack format consists of 5 different line types, distinguished
by the leading characters and numbered here by order of appearance in the file:

  i. Comments
  1. Header lines
  2. Column specification line
  3a. Bounding region specification line
  3b. Data lines

Note: The number preceding each line type defines the order in which the lines
must be present, i.e. column specification must follow the header lines, but
comments may be present anywhere. Note that a bounding region specification line
must be followed by a data line, but that a file may have multiple bounding
region specifications with data lines in between.

A GTrack validator is available at [3].

  -----------
  i. Comments
  -----------
  
  - Leading characters: #
  - Example
  
      #This is a comment!
      
  - Usage: Optional
  
  Comments are ignored by parsers and may be present anywhere in the file. 
  
  
  ---------------
  1. Header lines
  ---------------
  
  - Leading characters: ##
  
  - Format
  
      ##VARIABLE:[ ]*VALUE
  
      where
        VARIABLE = Header variable name
        [ ]* = Optional space characters
        VALUE = Header variable value
  
  - Example
  
      ##gtrack version: 1.0
      ##track type: valued points
      ##value type: category
      ##1-indexed:  False
      ##end inclusive:True
      
  - Usage
  
      Optional, but any header variables not declared regain their default
      values.
  
  - Restrictions
  
      * GTrack files may add custom header variables, e.g. as part of the
        definition of a GTrack subtype (see section "Defining GTrack subtypes").
        For reserved header variables, however, the values are restricted to the
        ones allowed by the header variable (see below).
      
      * All variable names and reserved variable values are treated as case
        insensitive and do not support character escaping. Custom values, i.e.
        header values of non-reserved header variables, do, however, support
        escaping. For more details, see the section "Detailed specification of
        character usage".
      
  
  Header lines provide structural information readable by both humans and
  automatic parsers. The GTrack format defines a reserved set of header
  variables, each with a default value. If a header variable is not declared in
  the header lines, the default value is used. We encourage the use of header
  lines even when they contain default values as this adds to the clarity of the
  file and helps reduce parsing errors. The order of the header lines is
  unimportant.
  
  Developer notes
  ---------------
  As not all parsers/tools will have the need to support the full GTrack
  specification, developers are welcome to support only subsets. We do, however
  encourage all GTrack parsers to always check the GTrack header lines and give
  feedback to the user if a particular feature is unsupported by the
  parser/tool. Note that non-reserved header lines should be ignored by parsers,
  unless they specifically support the particular extensions. We encourage
  parsers to print warning outputs for any unsupported, non-reserved header
  lines, as they may be a result of typing errors.
  
  Note also that, for consistency, the default values will not change in future
  versions of the GTrack specification.
  ---------------
  
  
  Reserved header variables
  -------------------------
  
  - GTrack version
  
      The version of the GTrack specification used for the file. 
      
      Default value: 1.0
  
  - Track type*
      one of:
        points
        valued points
        segments
        valued segments
        genome partition
        step function
        function
        linked points
        linked valued points
        linked segments
        linked valued segments
        linked genome partition
        linked step function
        linked function
        linked base pairs
          
      Defines the track type of a GTrack file. Each track type defines a set of
      core columns to be used. See the section "Column specification line" for
      more details.
      
      Default value: segments
  
  - Value type
  
      one of:
        number
        binary
        character
        category
      
      Only used if the "value" column is defined. Defines the kind of content
      accepted in the value column. See the section "Column specification line"
      for more details.
      
      Default value: number
  
  - Value dimension
  
      one of:
        scalar
        pair
        vector
        list
          
      Only used if the "value" column is defined. Defines the dimension of the
      content accepted in the value column. See the section "Column
      specification line" for more details.
      
      Default value: scalar
  
  - Undirected edges*
  
      Only used if the "edges" column is defined. True if all edges specified in
      the GTrack file are undirected, else false. Note that undirected edges
      between two track elements must still be specified in both data lines,
      using the same weights.
      
      Default: false
      
  - Edge weights*
  
      Only used if the "edges" column is defined. True if weights are specified
      for edges, else false. If true, all edges must have a weight
      specification, if false, no edges must specify weight.
      
      Default value: false
  
  - Edge weight type
  
      one of:
        number
        binary
        character
        category
      
      Only used if the "edges" column is defined and the "Edge weights" header
      variable is set to "true". Defines the kind of content accepted as edge
      weights. See the section "Column specification line" for more details.
      
      Default value: number
      
  - Edge weight dimension
  
      one of:
        scalar
        pair
        vector
        list
          
      Only used if the "edges" column is defined and the "Edge weights" header
      variable is set to "true". Defines the dimension of the content accepted
      as edge weights. See the section "Column specification line" for more
      details.
      
      Default value: scalar
  
  - Uninterrupted data lines*
  
      True if it is guaranteed that the data lines are not interrupted by
      bounding region specification lines (i.e. that more than one bounding
      region is specified), comments or blank lines, else false. This is used to
      help simple parsers.
      
      Default value: false
  
  - Sorted elements*
  
      True if it is guaranteed that all bounding regions and track elements come
      in sorted order. Bounding regions must be sorted first, and the track
      elements in each bounding region block second. Regions are sorted by the
      following fields, in ascending order (using only the ones that are
      defined): genome, seqid, start, end.
  
      Default: false
  
  - No overlapping elements*
  
      Only used for tracks of type Points and Segments, and the variations of
      these, i.e. Linked and/or Valued Points (VP/LP/LVP) and Linked and/or
      Valued Segments (VS/LS/LVS). True if it is guaranteed that no two track
      elements overlap, else false. 
  
      Default: false
  
  - Circular elements*
  
      True if any track element or bounding region cross the coordinate borders
      of a circular sequence, i.e. that the "end" value is smaller than the
      "start" value.
      
      Default: false
      
  - 1-indexed
  
      True if the coordinates start at 1, false if the coordinates start at 0.
      
      Default value: false
  
  - End inclusive
  
      True if the chromosome coordinate specified in the end column is included
      in intervals, else false.
      
      Default value: false
  
      Developer notes
      ---------------
      We recommend that all parsers always check the values of the header
      variables '1-indexed' and 'end inclusive', even if only one or some
      settings are supported by the parser. If the values defined in a GTrack
      file are unsupported, the parser should fail. This greatly reduces the
      risk of erroneous positional information.
      ---------------
  
  (Note that the section "Extended specification" includes more reserved header
  variables.)
  
  *   Some header lines include redundant information compared to the rest of
      the file. These are marked with * in the listing above. The redundant
      header lines are still explicitly defined for several reasons. First, in
      order for a human reader to easily find out which features are used in a
      file. Second, as a way for simple parsers that only use a subset of the
      specification to check whether they can parse a particular file. Third, it
      enables automatic validation of whether a file contains the information in
      the way the author intended. These header lines can be automatically
      extracted from the rest of a GTrack file by the "Expand GTrack headers"
      tool, available at [3].
  
  
  ----------------------------
  2. Column specification line
  ----------------------------
  
  - Leading characters: ###
  
  - Format
  
      ###COL1  COL2  COL3...
  
      where
        COL1, COL2, COL3 = Column names
        "  " = tab character
  
  - Example
  
      ###genome  seqid  start  end  strand  geneId  score  id  edges
      (with tabs instead of spaces)
  
  - Default value
      
      ###seqid  start  end
      (with tabs instead of spaces)
  
  - Usage
  
      Optional, but if not defined, retains the default value.
  
  - Restrictions
  
      * Column names are treated as case insensitive and do not support
        character escaping. For more details, see the section "Detailed
        specification of character usage".
      
      * All column names must be unique.
  
  
  The column specification line is a tab-separated list of column names.
  
  The GTrack specification defines a set of eight reserved column names. Four of
  these are associated with the four core informational properties: gaps,
  lengths, values and interconnections. The specific set of core columns present
  defines the track type (see [1] for more details). The GTrack format also
  defines 4 reserved columns that, although they do not define track type, have
  reserved meanings. The associations between the reserved columns and track
  types are shown in the following table:


                    Column name:  genome seqid start end value strand id edges 
                 Type of column:     N     N     C    C    C      N    N   C   
 Track type:
  Points                   (P)       ?     !     X    .    .      ?    ?   .    
  Segments                 (S)       ?     !     X    X    .      ?    ?   .    
  Genome Partition         (GP)      ?     !     .    X    .      ?    ?   .   

  Valued Points            (VP)      ?     !     X    .    X      ?    ?   .   
  Valued Segments          (VS)      ?     !     X    X    X      ?    ?   .   
  Step Function            (SF)      ?     !     .    X    X      ?    ?   .   
  Function                 (F)       ?     !     .    .    X      ?    ?   .   

  Linked Points            (LP)      ?     !     X    .    .      ?    X   X   
  Linked Segments          (LS)      ?     !     X    X    .      ?    X   X   
  Linked Genome Partition  (LGP)     ?     !     .    X    .      ?    X   X   

  Linked Valued Points     (LVP)     ?     !     X    .    X      ?    X   X   
  Linked Valued Segments   (LVS)     ?     !     X    X    X      ?    X   X   
  Linked Step Function     (LSF)     ?     !     .    X    X      ?    X   X   
  Linked Function          (LF)      ?     !     .    .    X      ?    X   X   

  Linked Base Pairs        (LBP)     ?     !     .    .    .      ?    X   X   

  C - Core reserved column (defines track type)
  N - Non-core reserved column (reserved, but does not define track type)
  X - Column is mandatory
  ? - Column is optional
  . - Column is not allowed
  ! - Property must be present, either as a column or in a bounding region
      specification (see below)
    
  Table 1: Overview of the eight reserved columns in the GTrack format and their
           associations to track type.


  Reserved columns
  ----------------
  
  - genome
  
      The genome assembly of the track element (e.g. hg19, mm9). The GTrack
      format has no explicit requirements on the syntax or semantics of the
      genome specification; the interpretation is up to the particular
      parsers/tools. Elements from different genomes are allowed in the same
      GTrack file.
     
      Specifying the genome of a track element is optional. The genome may be
      specified either as a separate column in the data lines, or in a preceding
      bounding region specification line (see below), or both. If genome is
      specified both in a bounding region specification and as a column, the
      values must be equal.
  
  - seqid
      
      A sequence identifier, i.e. an identifier of the underlying sequence of
      the particular track element. Usually defined as chromosome (e.g. chr3,
      chrY, chr2_random) or scaffold (e.g. scaffold10671), as defined in the
      genome assembly. As with the "genome" column, the GTrack format has no
      explicit requirements on the syntax or semantics of the "seqid" column;
      the interpretation is up to the particular parsers/tools. Some parsers may
      for instance allow chromosome arms (e.g. chr1p) as seqid.
      
      All track elements in a GTrack file must have a seqid, either as a
      separate column in the data lines, or in a preceding bounding region
      specification line (see below), or both. If seqid is specified both in a
      bounding region specification and as a column, the values must be equal.
  
  - start
  
      The start position of the track element, using the indexing system defined
      in the header (0- or 1-based).
      
      Developer notes
      ---------------
      The start column is not defined for some track types (as described in
      Table 1). In order to still work on the start position of an element, it
      has to be inferred from other information in the following manner,
      according to track type:
      
      Genome Partition (GP), Step Function (SF), Linked Genome Partition (LGP)
      and Linked Step Function (LSF):
          
        The start position of each track element can be seen as the position
        immediately following the end of the track element of the previous line.
        The exact value of the start position depends on the "End inclusive"
        header variable, i.e. if the coordinates are end-exclusive, the start
        position of one track element should be exactly the same as the end
        position of the previous line, if not, the start position should be set
        to the previous end position + 1. For the first line in a set of data
        lines, the start position should be set to the start position of the
        preceding bounding region (see section "Bounding region specification
        line").
          
      Function (F), Linked Function (LF) and Linked Base Pairs (LBP):
      
        Each line defines a successive location along the genome. The start of
        the first line in a set of data lines is then the start position of the
        preceding bounding region. The start value is then increased by 1 for
        each line.
      ---------------
  
  - end
  
      The end position of the track element, using the indexing system (0- or
      1-based) and "end inclusive" property as defined in the header. 
  
      Developer notes
      ---------------
      The end column is not defined for some track types (Points (P), Valued
      Points (VP), Function (F), Linked Points (LP), Linked Valued Points (LVP),
      Linked Function (LF) and Linked Base Pairs (LBP), as described in Table
      1). In order to still work on the end position of an element, it has to be
      inferred from the start position. In these cases, the end position depends
      on the "End inclusive" header variable. If true, the end position is the
      same as the start position, if false, the end position is the start
      position + 1.
      ---------------
  
  - strand
  
      The strand of the track element. "+" for positive, "-" for negative
      strand, and "." when strand information is missing or irrelevant.
  
  - value
  
      The value or score of the track element. The character "." denotes that
      the track element has a missing value. The basic type of the contents
      follow the "Value type" header variable as follows:
  
        number
          
          One floating point number, e.g. -1.23, 12 or 3.1e-4. English decimal
          notation is used, including scientific e notation, with the period
          character representing the decimal separator, but with no spacing.
          Note that integer numbers are a subset of floating point numbers, and
          should use "number" as the value type.
              
        binary
          
          One binary value. If this value is used to denote case and control,
          the following notation must be used: 1 for case, 0 for control.
              
        character
          
          One ASCII character, e.g. A, T, C. See the section "Detailed
          specification of character usage" for restrictions.
          
        category
          
          A string defining a category. The set of all category values over all
          track elements form a category set, e.g: {gene, exon, promoter}. See
          the section "Detailed specification of character usage" for
          restrictions.
      
      In addition, the "value dimension" header variable may define that the
      value contains more than one instance of the basic value type, as follows:
  
        list
           
          A list of values, following the basic type defined in the "Value type"
          header variable. Lists of numbers and categories are delimited by
          comma, e.g. 1.23,2.34,3.45,4,5 or exon,gene,CDS,gene. Lists of binary
          values and characters use no delimiter, e.g. 1011011010 or ATGCTCGACG.
          Lists that combine different basic types are not allowed. The length
          of lists may vary between track elements. The missing element
          character, ".", is allowed in lists.
              
        vector
           
          A vector of values, similarly defined as a list, with the only
          difference that vectors must have the same length throughout the
          GTrack file.
              
        pair
          
          A pair of values, similarly defined as a vector, with the only
          limitation that the length is exactly 2.
              
        scalar
          
          A single value, following the basic type defined in the "Value type"
          header variable, e.g. 1.23, 0, g or exon, respectively.
          
      Developer notes
      ---------------
      Note that the different dimensions are defined in a hierarchical manner:
      lists > vectors > pairs & scalars. All scalars or pairs are also vectors
      of length 1 or 2, respectively, and all vectors are lists. Support for
      lists in a parser should then also lead to the support of its
      "sub-dimensions", given, of course, that the analysis allows that they are
      treated in an equal fashion.
      ---------------
  
  - id
      
      An unique string identifying each track element (data line). Can be in any
      format, e.g. 1, aab or uc002ico.1. See the section "Detailed specification
      of character usage" for restrictions.
  
  - edges
  
      A semicolon-separated list of id's, representing edges from the track
      element in the current line to the track elements which the id's identify.
      A "." character denotes that the track element has no edges. An edge is by
      default directed.
      
      If the header variable "Edge weights" is set to "true", each edge must
      have a weight value directly following, after an equals sign. The format
      of the weight value follows the "Edge weight type" and the "Edge weight
      dimension" header variables in the same way as the "value" format follows
      the "Value type" and "Value dimension" header variables (see above). Note
      that no space characters are allowed after the semicolon.
  
      Example:
      
      ###seqid  start  end  id   edges
      chr1      0      100  aaa  aab=1.2;aac=.
      chr1      200    350  aab  aaa=1.1
      chr1      450    500  aac  .
  
      Here, the aaa node is connected to the aab node with two directed edges,
      with the edge from aaa to aab having higher weight than the one in the
      other direction. Note that undirected edges must still be specified in
      both directions, using the same weights. This adds redundancy, but
      simplifies parsing. If all edges in a GTrack file are undirected, the
      header variable "Undirected edges" should be set to "true".
  
  
  --------------------------------------
  3a. Bounding region specification line
  --------------------------------------
  
  - Leading characters: ####
  
  - Format
  
      Type A) ####genome=VAL1
      
        or
      
      Type B) ####[genome=VAL1;[ ]*]seqid=VAL2[;[ ]*start=VAL3][;[ ]*end=VAL4]
  
      where
        [x] = 'x' is optional
        [ ]* means optional space characters
        genome, seqid, start, end = reserved attribute names
        VAL1, VAL2, VAL3, VAL4 = attribute values
  
  - Example
  
      ####genome=hg18; seqid=chr1;start=100;  end=10000
  
  - Usage
  
      Type B is mandatory for GTrack files of one of the following track types:
        Genome Partition (GP)
        Step Function (SF)
        Function (F)
        Linked Genome Partition (LGP)
        Linked Step Function (LSF)
        Linked Function (LF)
        Linked Base Pairs (LBP)
          
      For all other track types, bounding region specification lines are
      optional.
  
  - Restrictions
  
      * Attribute names are treated as case insensitive and do not support
        character escaping. Genome and seqid values do, however, support
        escaping. For more details, see the section "Detailed specification of
        character usage".
      
      * A bounding region specification remains in effect for a set of data
        lines until the next bounding region specification.
      
      * If a GTrack file contains any bounding regions, then all elements must
        be enclosed by one.
  
      * Bounding regions are not allowed to overlap.
      
      * Bounding regions of type A and B are not allowed in the same GTrack
        file.
        
      * No data lines following a bounding region of type B may have start or
        end positions defined outside the bounding region
  
      * For track types Genome Partition (GP), Step Function (SF), Linked Genome
        Partition (LGP) and Linked Step Function (LSF), the "end" attribute must
        be equal to the end position of the last track element of the block of
        data lines immediately following the bounding region specification line.
      
        Example:
      
        ##track type: genome partition
        ###end
        ####seqid=chr1; start=100; end=200
        125
        133
        200
      
      * For track types Function (F), Linked Function (LF) and Linked Base Pairs
        (LBP), the "end" attribute must be exactly equal to the "start"
        attribute plus the number of data lines immediately following the
        bounding region specification line. If the header line "End inclusive"
        is true, the end position should be 1 less.
      
        Example:
      
        ##track type: function
        ###value
        ####seqid=chr1; start=100; end=103
        1.2
        -0.1
        0.8
  
  
  A bounding region specifies a genomic interval encompassing the data lines
  that follow. A bounding region should be thought of as constituting the domain
  of the following track elements, i.e. the region where we have information
  about the properties modeled by the track elements. The set of all bounding
  regions of a track then constitutes the domain of the track.
  
  Note that, in the case of Points and Segments (and the variations of these,
  i.e. Linked and/or Valued Points (VP/LP/LVP) and Linked and/or Valued Segments
  (VS/LS/LVS), see Table 1), lack of elements is also considered information. A
  bounding region is then, in this case, a region where we know that the lack of
  data means something. Areas of the genome that has not been investigated (such
  as centromeres) should be left outside the bounding regions. For track types
  other than Points and Segments (and their variations), the track elements do
  by definition fill the entire domain. For example, a Function has, by
  definition, a value for all base pairs in the domain. A bounding region is
  then just the smallest region encompassing the track elements that follow. For
  more details, see [1].
  
  The bounding region specification comes in two flavors:
  
  A)
      The bounding region specifies the genome assembly for the following track
      elements, using the same format as for the "genome" column (see the
      "Column specification line" section). The domain of the track is then the
      set of sequences constituting the genome, e.g. all chromosomes of the
      genome. If a track contains several genomes, the domain of the track is
      the collected set of sequences constituting all the specified genomes.
      
  B)
      The bounding region specifies a single sequence, or part of this sequence,
      as the domain of the following track elements. The format is a set of
      attribute pairs separated by semicolon and optional space characters. For
      each attribute pair, the attribute name and the value are separated by the
      equals sign. The attributes may appear in any order. The allowed
      attributes are the following:
      
      - genome
      
          The genome assembly of the bounding region(e.g. hg19, mm9). The format
          of the genome attribute is the same as for the "genome" column (see
          the section "Column specification line"). The "genome" attribute is
          optional.
          
      - seqid
      
          A sequence id, e.g. the id of the underlying sequence of the bounding
          region. The format of the seqid attribute is the same as for the
          "seqid" column (see the section "Column specification line"). The
          "seqid" attribute is mandatory for a bounding region specification
          line of type B.
          
          Note that if type B bounding region specifications are not defined,
          the "seqid" column must be included in the column specification line.
          
      - start
      
          The start position of the bounding region, using the indexing system
          defined in the header (0- or 1-based). The "start" attribute is
          optional.
          
          Developer notes
          ---------------
          If the "start" attribute is not specified, the start position of the
          bounding region is 0 (or 1, if the header variable "1-indexed" is
          true).
          ---------------
          
      - end
      
          The end position of the bounding region, using the indexing system (0-
          or 1-based) and "end inclusive" property as defined in the header. The
          "end" attribute is optional.
          
          Developer notes
          ---------------        
          If the "end" attribute is not specified, the end position of the
          bounding region is the same as the end position of the sequence
          referenced by the 'seqid' attribute, e.g. the length of the current
          chromosome. If the parser does not have information about the length
          of the sequence in question, the user should be informed, or, in the
          case that the bounding region is unimportant for the parser, the
          bounding region specification should be ignored.
          
          Note that the restrictions regarding the "end" attribute for certain
          track types (see section "Restrictions" above) must still hold, even
          if the "end" attribute is not explicitly specified.
          ---------------
  
  
  --------------
  3b. Data lines
  --------------
  
  - Leading characters: 
  
  - Format
  
      VAL1  VAL2  VAL3...
  
      where
        VAL1, VAL2, VAL3 = column values
        "  " = tab character
  
  - Example
  
      chr21  304  997  -  FOOGENE  423  1  .
      (with tabs instead of spaces)
          
  - Usage
  
      Data lines are optional.
  
  - Restrictions
  
      * Column values support character escaping, as specified in the section
        "Detailed specification of character usage".
      
      * The number of columns of each data line must be equal to the number of
        columns in the column definition line.
      
      * For track types Genome Partition (GP), Step Function (SF), Linked Genome
        Partition (LGP), and Linked Step Function (LSF), the data lines in each
        bounding region block must be sorted on the "end" value, in ascending
        order.
  
  
  Each data line is a tab-separated list of values, as defined by the column
  definition line. If there is a missing value in either of the "value" and
  "edges" columns, the period character, ".", may be used. See the section
  "Column specification line" for more details.
  
  
  -----------------
  BED compatibility
  -----------------
  
  Note that a simple BED file only using the three columns chr, start and end is
  directly compatible with the GTrack format. This is because the default track
  type of a GTrack file is Segments (S), which defines the same three core
  columns as a simple BED file (see Table 1). One may thus only rename the file
  ending of such a file from ".bed" to ".gtrack" and run it through a GTrack
  parser. If a UCSC custom track definition line or other headers are present,
  they must be commented out. More complex BED files must be converted.
  Converters to common file formats are available at [3].
  
  
  -----------
  Compression
  -----------
  
  As genomic tracks may contain large amounts of data, we require that fully
  compliant GTrack parsers support the expansion of tabular files compressed
  with the gzip compression algorithm [4]. Such GTrack files should have the
  suffix ".gtrack.gz".
  
  
  -----------------------------------------
  Detailed specification of character usage
  -----------------------------------------
  
  - The GTrack format supports escaping of special characters using URL escaping
    conventions (%XX hex codes). All ASCII characters are supported, except the
    following, which must be escaped everywhere:
      
      Most control characters (except TAB, LF, CR): %00-%08, %0B-%0C, 
                                                    %0E-%1F, %7F
      Extended ASCII characters: %80 through %FF
          
    Also, the following characters have reserved meaning, and must be escaped
    when used with other meanings in places where they may interfere with the
    parsing:
      
      tab (TAB): %09
      newline (LF): %0A
      carriage return (CR): %0D
      space: %20
      # (hash): %23
      % (percent): %25
      , (comma): %2C
      ; (semicolon): %3B
      = (equals): %3D
      . (period): %2E
    
    Note that spaces needs not be escaped in data lines, as the data values are
    separated by tabs.
      
  - Reserved phrases in a GTrack file receive special treatment. Reserved
    phrases include all header variable names, reserved header variable values
    (excluding custom header variable values), column names (including custom
    columns) and bounding region attribute names. Reserved phrases should be
    treated as case insensitive and do not support URL escaping.
  
  - One must in all cases avoid starting or ending a value with unescaped
    whitespace.
    
  - A line must end with the newline character (LF), optionally preceded by a
    carriage return (CR).
  
  - Blank lines should be ignored by parsers.
  
  - Comments, header lines, column specification lines and bounding region
    specification lines are characterized by the leading number of #-characters.
    Note that, except for comments, once the file reaches a certain "level" of
    #-characters, this count never goes down. Thus, header lines, column
    specification and bounding region specifications are always found in that
    order.
    
  - Note that delimiter characters differ for the various lines/columns. See the
    specification above for details. Also note that examples in this file use
    spaces instead of tabs for readability. These examples should not be
    directly copied into GTrack files.

    
------------------------------
    Extended specification
------------------------------

  The extended part of the GTrack specification consists of the following header
  variables:
  
    - Value column
    - Edges column
    - Fixed length
    - Fixed gap size
    - Fixed-size data lines
    - Data line size
    - GTrack subtype
    - Subtype version
    - Subtype URL
    - Subtype adherence
    
  These header variables are redundant compared to the basic GTrack
  specification, that is, they do not allow any extra types of information to be
  represented. They do, however, allow existing information to be represented in
  more practical ways, in addition to supporting standardized ways of extending
  the GTrack format by defining GTrack subtypes.

  
  -----------------------
  Redefining column names
  -----------------------
  
  A GTrack file may contain several columns that could be used as the 'value'
  column, and similarly for the 'edges' column. To change which columns are
  used, one must, as described in the basic GTrack specification, modify the
  column specification line. The following header variables may, however,
  simplify the process.
  
  - Value column
  
      The name of the column to be used for as the 'value' column.
      
      Default: value
      
  - Edges column
  
      The name of the column to be used for as the 'edges' column.
      
      Default: edges
      
  Note that if either of these header variables has a non-default value, the
  corresponding default value ('value' or 'edges') must not be included in the
  column specification line. The following example is thus an incorrect GTrack
  file:
  
    ##track type: valued segments
    ##value column: score
    ###seqid  start  end  value  score
    chr1      0      50   1.0    0.9
    chr1      100    125  1.1    0.8
    
  The following file does, however, follow the GTrack specification:
  
    #
    # GTrack example file 4
    #
    ##track type: valued segments
    ##value column: score2
    ###seqid  start  end  score1  score2
    chr1      0      50   1.0     0.9
    chr1      100    125  1.1     0.8
  
  Developer notes
  ---------------
  The 'value column' and the 'edges column' header variables should be
  interpreted prior to parsing the column specification line. The column name
  referred to by the variable(s) should be renamed to 'value' or 'edges',
  respectively. If two columns in this way ends up with the same name, the
  parser should return an error. In this way, a parser that does not support the
  'value column' and 'edges column' header variables will issue an error when a
  properly specified GTrack file with such headers are parsed, as, in that case,
  the track type will not match the column specification line according to table
  1. Parsing errors are recommended over incorrect analysis results caused by
  erroneous interpretation of columns.
  ---------------

  
  -----------------
  WIG compatibility
  -----------------
  
  The WIG format [6] includes the parameters 'step' and 'span', specifying a
  fixed step size, i.e. the distance between start positions, and a fixed span
  size, i.e. the length of track elements, respectively. Consider for instance
  the following WIG file:
  
    fixedStep chrom=chr1 start=201 step=100 span=50
    25.0
    26.0
    fixedStep chrom=chr2 start=151 step=100 span=50
    10.0
    11.0
    
  A GTrack version of this file, using the basic specification, would look
  something like this, using three columns instead of one:
  
    #
    # GTrack example file 5A
    #
    ##Track type: valued segments
    ##1-indexed: true
    ##End inclusive: true
    ###start  end  value
    ####seqid=chr1
    201  250  25.0
    301  350  26.0
    ####seqid=chr1
    151  200  10.0
    251  300  11.0 
  
  In order to support WIG-like functionality in GTrack, the following header
  variables may be used:
   
  - Fixed length
  
      Only used when the end column is not specified. Defines a fixed length for
      all elements in the GTrack file.
      
      Restrictions:
      
        * fixed length >= 1
      
      Track type dependency:
      
        When fixed length > 1, the track type should be determined as though the
        end column is present (see Table 1).
      
      Default: 1
      
      Developer notes
      ---------------
      Contrary to the restrictions of bounding regions of type B (see above),
      the end position of the segments in a bounding region is allowed to cross
      the region border, if implicitly defined by the "fixed length" header
      variable. Depending on the application, the parser must decide whether to
      crop the length of the elements, i.e. set the end position of any elements
      crossing the region border (typically the last element) equal to the end
      position of the surrounding bounding region.
      ---------------
      
  - Fixed gap size
  
      Only used when neither the start nor the end column is specified. Defines
      fixed-size gaps between all neighboring elements in the same bounding
      region. Gap size is defined as the number of uncovered base pairs between
      the elements. The following equation defines the relation between length,
      gap size and start positions:
      
        start_n+1 = start_n + fixed length + fixed gap size
        
      where
        'start_n+1' is the start position of a track element immediately
        following an element with start position 'start_n' in the same bounding
        region.
        
      Restrictions:
      
        * fixed length + fixed gap size > 0
        
        * Only allowed in GTrack files using bounding regions of type B (see
          section "Bounding region specification line").
      
      Track type dependency:
      
        When fixed gap size != 0, the track type should be determined as though
        the start column is present (see Table 1).
        
      Default: 0
      
  To convert from a WIG file to a GTrack file, one may use the following
  formulas:
  
    fixed length = span
      
    fixed gap size = step - span
      
  
  The WIG file shown above may then be represented in the following way as a
  GTrack file:
  
    #
    # GTrack example file 5B
    #
    ##Track type: valued segments
    ##1-indexed: true
    ##End inclusive: true
    ##Fixed length: 50
    ##Fixed gap size: 50
    ###value
    ####seqid=chr1; start=201
    25.0
    26.0
    ####seqid=chr2; start=151
    10.0
    11.0 
    
  Note that the definitions above allow negative values for the variable 'fixed
  gap size'. Such values may be used to represent sliding windows, i.e. segments
  that overlap with a fixed number of base pairs.
  
  
  -------------------
  FASTA compatibility
  -------------------
  
  The following header variables may be used to represent FASTA-like sequences
  [5], and other simple function tracks, such as GC content, in a condensed
  manner. Consider a GTrack file of type 'Function', with only the value column
  specified:
  
    #
    # GTrack example file 6A
    #
    ##Track type: function
    ##Value type: character
    ###value
    ####seqid=seq001
    A
    G
    C
    ####seqid=seq002
    G
    G
  
  This is a valid GTrack file according to the basic specification. However,
  reading a sequence using only one nucleotide per line is quite impractical.
  The following header variables changes the interpretation of the data lines:
  
  - Fixed-size data lines
  
      True if each data line has an exact size in terms of number of characters.
      This is only allowed for track type Function (F), and only if the only
      column specified is "value". Newline and carriage return characters are
      ignored when parsing, and the data lines are separated using the number of
      characters specified in the header variable "Data line size" (below).
      
      Developer notes
      ---------------
      Note that parsers still need to be able to recognize bounding region
      specification lines.
      ---------------
      
      Default: false
      
  - Data line size
  
      The size of each data line in terms of number of characters. Is only used
      if the header variable "Fixed-size data lines" (above) is true.
      
      Default: 1
      
  Using these header variables, the example GTrack file shown above can be
  expressed in the following way:
    
    #
    # GTrack example file 6B
    #
    ##Track type: function
    ##Value type: character
    ##Fixed-size data lines: true
    ##Data line size: 1
    ###value
    ####seqid=seq001
    AGC
    ####seqid=seq002
    GG
  
  
  ------------------------
  Defining GTrack subtypes
  ------------------------
  
  The GTrack format includes support for defining GTrack subtypes, that is, file
  formats that adhere to only a subset of the GTrack specification. This allows
  implementation of more specialized parsers, while at the same time ensuring
  that subtype GTrack files still work with fully compliant GTrack parsers.
  GTrack subtypes may also be used to standardize special GTrack configurations,
  removing the need for the individual GTrack files to include all the required
  meta information. We encourage independent specification of subtypes catering
  to specialized needs.
  
  A GTrack subtype defines default values for header variables and/or the column
  specification line. A subtype may also add new header variables or define how
  parsers should interpret the values of any non-reserved columns. GTrack
  subtypes must still conform to the GTrack specification. Interpretation of new
  columns or header lines do of course require specialized parsers.
  
  
  Example #1: FASTA
  -----------------
  
  As an example of the use of subtypes, we show how GTrack can be used in a
  similar manner as conventional FASTA files [5] (see the section "FASTA
  compatibility" above). Example file 7A is the subtype specification file:
  
    
    #
    # GTrack example file 7A
    #
    # Specification of FASTA subtype for GTrack.
    # Available at http://gtrack.no/fasta.gtrack
    #
    ##GTrack version: 1.0
    ##GTrack subtype: FASTA
    ##Subtype version: 1.0
    ##Subtype adherence: strict
    ##Track type: function
    ##Value type: character
    ##Fixed-size data lines: true
    ##Data line size: 1
    ###value
  
  
  When using the subtype, an "online" parser will download the subtype
  specification file (over) and use the specified header values and/or column
  specification line instead of the GTrack default values. The header of a
  GTrack file adhering to the subtype may then be as simple as including the URL
  of the subtype specification, as in example file 7B:
  
  
    #
    # GTrack example file 7B
    #
    # This file makes use of the FASTA subtype specification.
    #
    ##Subtype URL: http://gtrack.no/fasta.gtrack
    ####seqid=seq0001
    TAGACATTACCGCTAGGATGATGCGATCGATCGATCCCTCTGGATTAGGAGATCTCTAGATCGATGATATCCTCNN
    NNNNNNNATTGCTCTAGCTCTAGCTCTAGCT
    ####seqid=seq0002
    GATTACATATCGCGATCGACTCGCCACTATAACTTCGAGTCTGACGATGATGGGGGGG
  
  
  GTrack subtype header lines
  ---------------------------
  
  Subtype functionality is applied with the following header variables:
  
  - GTrack subtype
  
      The name of the subtype of the GTrack format used for the file, if any.
      May be specified if a GTrack file conforms to a subtype, even if the
      header variable "Subtype URL" is not specified.
      
      Developer notes
      ---------------
      Custom parsers that only support certain subtypes should check this header
      and give feedback to users if the subtype is not correct.
      ---------------
      
      Default value: ""
  
  - Subtype version
  
      The version of the GTrack subtype. May be specified if a GTrack file
      conforms to a subtype, even if the header variable "Subtype URL" is not
      specified.
      
      Default value: 1.0
  
  - Subtype URL
  
      URL to a GTrack file used as a specification/model for the GTrack subtype,
      if any. The subtype GTrack specification file is a normal GTrack file, but
      without bounding region specification lines or data lines. The header
      lines and/or the column specification line of a GTrack subtype model file
      is used instead of the default values for other GTrack files that adhere
      to the subtype. Any other specifications/restrictions should be included
      as comments.
      
      The "subtype URL" header variable is not allowed in GTrack subtype
      specification files.
      
      Developer notes
      ---------------
      If a GTrack file contains a Subtype URL header line, the subtype
      specification file should be downloaded by the parser. Incomplete URLs
      without a specified scheme (e.g. "gtrack.no") should be treated as
      HTTP-addresses (e.g. "http://gtrack.no"). Any inconsistencies between
      header lines of the GTrack files and the subtype headers should be treated
      according to the "subtype adherence" header variable (see below). If the
      header variables "GTrack subtype" or "Subtype version" (see below) in a
      GTrack file do not correspond to the same header variables in the subtype
      specification file, the user should be informed. It is then up to the
      parser to decide whether or not to continue parsing.
      
      If subtype specification downloading is not supported by the parser and a
      subtype URL is provided in the GTrack file, the user should be informed
      that he/she may use the "Expand GTrack headers" tool available at [3] in
      order to merge the subtype headers with the GTrack file for use in
      "offline" parsers.
      ---------------
      
      Default value: ""
  
  - Subtype adherence
  
      Subtype adherence may be specified in the subtype GTrack specification
      file and will then regulate the way a GTrack file may override the subtype
      specification. The subtype adherence may also be specified in a GTrack
      file, and will in this case function as a signal to parsers. In this way,
      different parsers may allow different levels of adherence for GTrack files
      of the same subtype.
      
      The following values are allowed:
      
        strict
        
          Values of header variables and the column specification line, as
          defined by the subtype, may not be overridden by the contents of a
          file. GTrack defaults may be overridden.
          
          This option may be used to force users of a subtype to follow the
          specification exactly.
            
        extensible
        
          As strict, but allows redefinition of the column specification line in
          one aspect:
          
          * any number of extra columns, including non-core reserved columns,
            may be added to the end of the column specification line. Adding
            core reserved columns is not allowed.
          
          This option may be used to allow users of a subtype to add their own
          content, while maintaining the exact interpretation of the first
          columns as defined by the subtype.
            
        redefinable
            
          As extensible, but allows redefinition of the column specification
          line in another aspect:
          
          * the "value" and "edges" columns may be redefined, i.e. any non-core
            column names may be renamed to "value" or "edges", and vice-versa,
            or the "value" and/or "edges" column may be added to the end of the
            column specification line.
              
          * correspondingly, the header lines "Track type", "Value type", "Value
            dimension", "Undirected edges", "Edge weights", "Edge weight type",
            "Edge weight dimension", "Value column" and "Edges column" may also
            be redefined by the GTrack file.
              
          This option may be used to allow users of a subtype to add their own
          content, including redefining the "value" and "edges" columns, while
          maintaining exactly the same content in the first columns as defined
          by the subtype.
               
        reorderable
        
          As strict, but allows redefinition of the column specification line in
          the following manner:
          
          * all columns specified in the subtype specification must be included,
            but can be put in any order, and any extra columns may be added.
              
          * correspondingly, the header line "Track type" may also be redefined
            by the GTrack file.
              
          Note that in this case, redefinition of the "value" or "edges" columns
          is not allowed, as in "redefinable", but a "value" or an "edges"
          column may be added, if not present. This restriction guarantees
          consistent indentification of columns by column name.
          
          This option may be used to allow users of a subtype to adopt their own
          column ordering, while at the same time maintaining that a minimum of
          columns must be present, identifiable by column name.
                 
        free
        
          Everything is allowed, as long as the GTrack specification is
          followed.
          
          This option leads to the subtype specification being used for no more
          than an alternative definition of default values of the GTrack header
          lines and column specification line.
   
      Developer notes
      ---------------
      Note that if subtype adherence is specified in the subtype specification
      as anything other than "free", a GTrack file using the subtype
      specification may not redefine this value.
      ---------------
  
      Default value: free
  
  
  Example #2: Short reads
  -----------------------
  
  As an extra example of the subtype functionality, we here propose a format for
  storing short reads (e.g. from ChIP-seq experiments). Again, example file 8A
  is the GTrack subtype specification file, and example file 8B is a GTrack file
  making use of the subtrack:
  
  
    #
    # GTrack example file 8A
    #
    # Specification of Short reads example subtype.
    # Available at http://gtrack.no/shortreads_example.gtrack
    #
    ##GTrack version: 1.0
    ##GTrack subtype: Short reads example
    ##Subtype version: 0.9
    ##Subtype adherence: redefinable
    ##Track type: segments
    ###seqid  start  end  strand  read        quality
    #
    # Unmapped reads may be stored in comment lines at the end of the file, as
    # exemplified below.
    #
    # Unmapped reads:
    #
    # AGATAGATAGGATCCCAGCTGACT
    # AGTCCTCTAGCTCTGACTATC
  
  
  ---
  
  
    #
    # GTrack example file 8B
    #
    # GTrack file making use of the Short reads example subtype.
    #
    ##Track type: valued segments
    ##Subtype URL: http://gtrack.no/shortreads_example.gtrack
    ###seqid  start  end  strand  read        value    new
    chr1      101    111  +       AGTAGATAGC  0.8      0
    chr1      203    244  -       0:C;15:G    0.7      1
    #
    # Unmapped reads:
    #
    # ATGAATATTAAAAATCTCCT
    # AGCGACCATACGTACATTACGAC
  
  
  The "Short reads example" subtype defines two extra columns, named "read" and
  "quality". A read is then either the exact read (using nucleotide symbols with
  the exact same length as the track element) or a semicolon-separated list of
  colon-separated mismatches, where a mismatch is represented by a relative
  position and a nucleotide symbol. The reference is here the genome assembly
  specified in the description lines. The relative positions should follow the
  indexing defined by the "1-indexed" header variable. The column quality
  contains the quality score of the read. According to the "redefinable" subtype
  adherence setting, adding columns to the end are allowed. In example file 7B,
  the "new" column is added. Also note that the "redefinable" setting allows the
  redefinition of any column as a "value" column, here the "quality" column.
  
  A set of basic GTrack subtypes are available from [3].


------------------
    References
------------------

[1] Gundersen S, Kalas M, Abul O, Frigessi A, Hovig E, Sandve GK: Identifying
    elemental genomic track types and representing them uniformly. In press.
[2] http://genome.ucsc.edu/FAQ/FAQformat.html
[3] http://www.gtrack.no
[4] http://www.gzip.org
[5] http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml
[6] http://genome.ucsc.edu/goldenPath/help/wiggle.html


------------------
    Change log
------------------

v1.0 - 2011.12.23:

    * First public version, included as "Additional file 1" in [1].