gfakluge

A C++ library and utilities for manipulating the Graphical Fragment Assembly format.

View the Project on GitHub edawson/gfakluge

gfakluge

DOI status

Build Status

What is it?

GFAKluge is a C++ parser/writer and a set of command line utilities for manipulating GFA files. It parses GFA to a set of data structures that represent the encoded graph. You can use these components and their fields/members to build up your own graph representation. You can also convert between GFA 0.1 <-> 1.0 <-> 2.0 to glue programs that use different GFA versions together.

Homepage: https://github.com/edawson/gfakluge
License: MIT

Dependencies

A C++11 compliant compiler (we recommend GCC or clang)
OpenMP (via GCC or clang)
NB: GFAKluge cannot be compiled with Apple clang, as it does not include OpenMP.

Command line utilities

When make is run, the gfak binary is built in the top level directory. It offers the following subcommands:

For CLI usage, run any of the above (including gfak with no subcommand) with no arguments or -h. To change specification version, most commands take the -S flag and a single double argument.

Example CLI Usage

Examples of various commands are included in the examples.md file.

C++ API

Examples of the C++ API are included in the interface.md file.

How do I build it?

The gfak utilities are available via homebrew: brew install brewsci/bio/gfakluge

Building GFAKluge from source requires OpenMP. This should be supported on Linux by default. On Apple Mac OS X, we recommend installing gcc:

brew install gcc@8
make CXX=g++-8

or

sudo port install gcc8
make

You can then build libgfakluge and the command line gfak utilities by typing make in the repo.
To use GFAKluge in your program, you’ll need to add a few lines to your code. First, add the necessary include line to your C++ code:
#include “gfakluge.hpp”

Next, make sure that the library is on the proper system paths and compile line:

            g++ -o my_exe my_exe.cpp -L/path/to/gfakluge/ -lgfakluge

You should then be able to parse and manipulate gfa from your program:

                gg = GFAKluge();
                gg.parse_gfa_file(my_gfa_file); 

                cout << gg << endl;

Why gfak / gfakluge?

Internal Structures

Internally, lines of GFA are represented as structs with member variables that correspond to their defined fields. Here’s the definition for a sequence line, for example:

            struct sequence_elem{
                std::string seq;
                std::string name;
                map<string, string> opt_fields;
                long id;
            };

The structs for contained elements, link elements, and alignment elements are very similar. These individual structs are then wrapped in a set of standard containers for easy access:

            map<std::string, std::string> header;
            map<string, sequence_elem> name_to_seq;
            map<std::string, vector<contained_elem> > seq_to_contained;
            map<std::string, vector<link_elem> > seq_to_link;
            map<string, vector<alignment_elem> > seq_to_alignment;

All of these structures can be accessed using the get_<Thing> method, where <Thing> is the name of the map you would like to retrieve. They reside in gfakluge.hpp.

GFA2

GFAKluge now supports GFA2! This brings with it four new structs: edge_elem, gap_elem, fragment_elem, and group_elem. They’re contained in maps much like those for the GFA1 structs.

A few caveats apply:
1. As GFA2 is a superset of GFA1, we support only support legal GFA2 -> GFA1 conversions. Information can be lost along the way (e.g. unordered groups won’t be output). 2. Our GFA2 testing is a bit limited but we’ve verified several times to be on-spec.

Tags we specifically do not (i.e. cannot) support in GFA2 -> GFA1 conversion: G - gap, U - unordered group, F - fragment. Links and containments should get converted to edges correctly. Sequence elements should get converted, but watch out for the length field if you hit issues.

GFAKluge is fully compliant with reading GFA2 and GFA0.1 <-> GFA1.0 -> GFA2.0 conversion as of September 2017.

Reading GFA

            GFAKluge gg;
            gg.parse_gfa_file("my_gfa.gfa");

You can then iterate over the aforementioned maps/structs and build out your own graph representation.

I’m working on a low-memory API for reading lines / emitting structs but it won’t be this pretty.

Writing GFA

            GFAKluge og;

            sequence_elem s;
            s.sequence = "GATTACA";
            s.name = "seq1";
            og.add_sequence(s);

            sequence_elem t;
            t.sequence = "AATTGN";
            t.name = "seq2";
            og.add_sequence(t);

            link_elem l;
            l.source = s.name;
            l.sink = s.name;
            l.source_orientation_forward = true;
            l.sink_orientation_forward = true;
            l.pos = 0;
            l.cigar = "";

            og.add_link(l.source, l);

            cout << og << endl;
            ofstream f = ofstream("my_file.gfa);
            // Write GFA1
            f << og;

            // To convert to GFA2:
            og.set_version(2.0);
            f << od;

Status

Getting Help

Eric T Dawson
github: edawson
Please post an issue for help.

Contributing

GFAKluge is open-source and community contributions are welcome and appreciated! Please keep the following in mind when contributing to the repo:

  1. Please treat others with kindness and professionalism. Everyone is welcome and we will not tolerate harassment for any reason.
  2. Please keep gfakluge.hpp header-only and update the build process if a modification alters it.
  3. Please update the dependency list if one is added.
  4. Please use semantic versioning. Minor changes bump the third versioning digit (e.g. 1.0.0 -> 1.0.1).
    Additional features, or changes that may or may not partially break backward compatibility but which do not require significant modifications to code depending on the library bump the second versioning digit (e.g. 1.0.0 -> 1.1.0).
    Changes which signficantly alter the API require a bump in the major version digit (e.g. 1.0.0 -> 2.0.0).
  5. Please fully specify all namespace items (e.g. std::stream in place of just stream).
  6. To incorporate changes, please file a pull request on the Github page.
  7. Bug reports or feature requests should be posted as “issues” on the Github page with the appropriate tag and referenced in any relevant pull requests.