首页 \ 问答 \ 用Boost :: spirit编写的解析器的性能问题(Performance issue with parser written with Boost::spirit)

用Boost :: spirit编写的解析器的性能问题(Performance issue with parser written with Boost::spirit)

 我想分析一个看起来像这样的文件（类似FASTA的文本格式）：  
    >InfoHeader
    "Some text sequence that has a line break after every 80 characters"
    >InfoHeader
    "Some text sequence that has a line break after every 80 characters"
    ...
 
 例如：  
    >gi|31563518|ref|NP_852610.1| microtubule-associated proteins 1A/1B light chain 3A isoform b [Homo sapiens]
    MKMRFFSSPCGKAAVDPADRCKEVQQIRDQHPSKIPVIIERYKGEKQLPVLDKTKFLVPDHVNMSELVKI
    IRRRLQLNPTQAFFLLVNQHSMVSVSTPIADIYEQEKDEDGFLYMVYASQETFGFIRENE
 
 我为boost :: spirit编写了一个解析器。 解析器正确地将标题行和以下文本序列存储在std::vector< std::pair< string, string >>但对于较大的文件需要很长的时间（对于100MB文件为17秒）。 作为比较，我写了一个没有boost :: spirit（只是STL函数）的程序，它简单地将这个100MB文件的每一行复制到一个std::vector 。 整个过程不到一秒钟。 用于比较的“程序”没有达到目的，但我认为解析器不应该花那么多时间......  
 我知道有很多其他的FASTA解析器，但我很好奇为什么我的代码很慢。  
 .hpp文件：  
#include <boost/filesystem/path.hpp>

namespace fs = boost::filesystem;


class FastaReader {

public:
    typedef std::vector< std::pair<std::string, std::string> > fastaVector;

private:
    fastaVector fV;
    fs::path file;  

public:
    FastaReader(const fs::path & f);
    ~FastaReader();

    const fs::path & getFile() const;
    const fastaVector::const_iterator getBeginIterator() const;
    const fastaVector::const_iterator getEndIterator() const;   

private:
    void parse();

};
 
 和.cpp文件：  
#include <iomanip>
#include <boost/date_time/posix_time/posix_time.hpp>
#include <boost/filesystem/fstream.hpp>
#include <boost/filesystem/operations.hpp>
#include <boost/filesystem/path.hpp>
#include <boost/spirit/include/classic_position_iterator.hpp>
#include <boost/spirit/include/phoenix_bind.hpp>
#include <boost/spirit/include/phoenix_core.hpp>
#include <boost/spirit/include/phoenix_fusion.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>
#include <boost/spirit/include/qi.hpp>
#include "fastaReader.hpp"


using namespace std;

namespace fs = boost::filesystem;
namespace qi = boost::spirit::qi;
namespace pt = boost::posix_time;

template <typename Iterator, typename Skipper>
struct FastaGrammar : qi::grammar<Iterator, FastaReader::fastaVector(), qi::locals<string>, Skipper> {
    qi::rule<Iterator> infoLineStart;
    qi::rule<Iterator> inputEnd;
    qi::rule<Iterator> lineEnd;
    qi::rule<Iterator, string(), Skipper> infoLine;
    qi::rule<Iterator, string(), Skipper> seqLine;
    qi::rule<Iterator, FastaReader::fastaVector(), qi::locals<string>, Skipper> fasta;


    FastaGrammar() : FastaGrammar::base_type(fasta, "fasta") {
        using boost::spirit::standard::char_;
        using boost::phoenix::bind;
        using qi::eoi;
        using qi::eol;
        using qi::lexeme;
        using qi::_1;
        using qi::_val;
        using namespace qi::labels;

        infoLineStart = char_('>');
        inputEnd = eoi;

        /* grammar */       
        infoLine = lexeme[*(char_ - eol)];
        seqLine = *(char_ - infoLineStart);

        fasta = *(infoLineStart > infoLine[_a = _1] 
            > seqLine[bind(&FastaGrammar::addValue, _val, _a, _1)]
            )
            > inputEnd
        ;

        infoLineStart.name(">");
        infoLine.name("sequence identifier");
        seqLine.name("sequence");

    }

    static void addValue(FastaReader::fastaVector & fa, const string & info, const string & seq) {
        fa.push_back(make_pair(info, seq));
    }
};


FastaReader::FastaReader(const fs::path & f) {
    this->file = f; 
    this->parse();
}


FastaReader::~FastaReader() {}


const fs::path & FastaReader::getFile() const {
    return this->file;
}


const FastaReader::fastaVector::const_iterator FastaReader::getBeginIterator() const {
    return this->fV.cbegin();
}


const FastaReader::fastaVector::const_iterator FastaReader::getEndIterator() const {
    return this->fV.cend();
}


void FastaReader::parse() {
    if ( this->file.empty() ) throw string("FastaReader: No file specified.");
    if ( ! fs::is_regular_file(this->file) ) throw (string("FastaReader: File not found: ") + this->file.string());

    typedef boost::spirit::istream_iterator iterator_type;
    typedef boost::spirit::classic::position_iterator2<iterator_type> pos_iterator_type;
    typedef FastaGrammar<pos_iterator_type, boost::spirit::ascii::space_type> fastaGr;

    fs::ifstream fin(this->file);
    if ( ! fin.is_open() ) {
        throw (string("FastaReader: Access denied: ") + this->file.string());
    }

    fin.unsetf(ios::skipws);

    iterator_type begin(fin);
    iterator_type end;

    pos_iterator_type pos_begin(begin, end, this->file.string());
    pos_iterator_type pos_end;

    fastaGr fG;
    try {
        std::cerr << "Measuring: Parsing." << std::endl;
        const pt::ptime startMeasurement = pt::microsec_clock::universal_time();

        qi::phrase_parse(pos_begin, pos_end, fG, boost::spirit::ascii::space, this->fV);

        const pt::ptime endMeasurement = pt::microsec_clock::universal_time();
        pt::time_duration duration (endMeasurement - startMeasurement);
        std::cerr << duration <<  std::endl;
    } catch (std::string str) {
        cerr << "error message: " << str << endl;
    }   
}
 
 所以语法会做以下事情：它会查找“>”符号，然后存储所有后续字符，直到检测到EOL。 在EOL之后，当检测到“>”符号时，文本序列开始并结束。 然后通过调用FastaReader::addValue()两个字符串（标题行和文本序列）存储在std :: vector中。  
 我使用g ++版本4.8.2编译我的程序，使用-O2和-std = c ++ 11标志。  
 那么我的代码中的性能问题在哪里？ 

I want to parse a file that looks like this (FASTA-like text format): 
    >InfoHeader
    "Some text sequence that has a line break after every 80 characters"
    >InfoHeader
    "Some text sequence that has a line break after every 80 characters"
    ...
 
e.g.: 
    >gi|31563518|ref|NP_852610.1| microtubule-associated proteins 1A/1B light chain 3A isoform b [Homo sapiens]
    MKMRFFSSPCGKAAVDPADRCKEVQQIRDQHPSKIPVIIERYKGEKQLPVLDKTKFLVPDHVNMSELVKI
    IRRRLQLNPTQAFFLLVNQHSMVSVSTPIADIYEQEKDEDGFLYMVYASQETFGFIRENE
 
I wrote a parser for this with boost::spirit. The parser correctly stores the header line and the following text sequence in a std::vector< std::pair< string, string >> but it takes kind of long for bigger files (17sec for a 100MB file). As comparison I wrote a program without boost::spirit (just STL functions) that simply copies each line of that 100MB file in a std::vector. The whole process takes less than a second. The "program" used for the comparison is not serving the purpose but I don't think the parser should take that much longer... 
I know there are plenty of other FASTA parsers around but I'm rather curious why my code is slow. 
The .hpp file: 
#include <boost/filesystem/path.hpp>

namespace fs = boost::filesystem;


class FastaReader {

public:
    typedef std::vector< std::pair<std::string, std::string> > fastaVector;

private:
    fastaVector fV;
    fs::path file;  

public:
    FastaReader(const fs::path & f);
    ~FastaReader();

    const fs::path & getFile() const;
    const fastaVector::const_iterator getBeginIterator() const;
    const fastaVector::const_iterator getEndIterator() const;   

private:
    void parse();

};
 
And the .cpp file: 
#include <iomanip>
#include <boost/date_time/posix_time/posix_time.hpp>
#include <boost/filesystem/fstream.hpp>
#include <boost/filesystem/operations.hpp>
#include <boost/filesystem/path.hpp>
#include <boost/spirit/include/classic_position_iterator.hpp>
#include <boost/spirit/include/phoenix_bind.hpp>
#include <boost/spirit/include/phoenix_core.hpp>
#include <boost/spirit/include/phoenix_fusion.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>
#include <boost/spirit/include/qi.hpp>
#include "fastaReader.hpp"


using namespace std;

namespace fs = boost::filesystem;
namespace qi = boost::spirit::qi;
namespace pt = boost::posix_time;

template <typename Iterator, typename Skipper>
struct FastaGrammar : qi::grammar<Iterator, FastaReader::fastaVector(), qi::locals<string>, Skipper> {
    qi::rule<Iterator> infoLineStart;
    qi::rule<Iterator> inputEnd;
    qi::rule<Iterator> lineEnd;
    qi::rule<Iterator, string(), Skipper> infoLine;
    qi::rule<Iterator, string(), Skipper> seqLine;
    qi::rule<Iterator, FastaReader::fastaVector(), qi::locals<string>, Skipper> fasta;


    FastaGrammar() : FastaGrammar::base_type(fasta, "fasta") {
        using boost::spirit::standard::char_;
        using boost::phoenix::bind;
        using qi::eoi;
        using qi::eol;
        using qi::lexeme;
        using qi::_1;
        using qi::_val;
        using namespace qi::labels;

        infoLineStart = char_('>');
        inputEnd = eoi;

        /* grammar */       
        infoLine = lexeme[*(char_ - eol)];
        seqLine = *(char_ - infoLineStart);

        fasta = *(infoLineStart > infoLine[_a = _1] 
            > seqLine[bind(&FastaGrammar::addValue, _val, _a, _1)]
            )
            > inputEnd
        ;

        infoLineStart.name(">");
        infoLine.name("sequence identifier");
        seqLine.name("sequence");

    }

    static void addValue(FastaReader::fastaVector & fa, const string & info, const string & seq) {
        fa.push_back(make_pair(info, seq));
    }
};


FastaReader::FastaReader(const fs::path & f) {
    this->file = f; 
    this->parse();
}


FastaReader::~FastaReader() {}


const fs::path & FastaReader::getFile() const {
    return this->file;
}


const FastaReader::fastaVector::const_iterator FastaReader::getBeginIterator() const {
    return this->fV.cbegin();
}


const FastaReader::fastaVector::const_iterator FastaReader::getEndIterator() const {
    return this->fV.cend();
}


void FastaReader::parse() {
    if ( this->file.empty() ) throw string("FastaReader: No file specified.");
    if ( ! fs::is_regular_file(this->file) ) throw (string("FastaReader: File not found: ") + this->file.string());

    typedef boost::spirit::istream_iterator iterator_type;
    typedef boost::spirit::classic::position_iterator2<iterator_type> pos_iterator_type;
    typedef FastaGrammar<pos_iterator_type, boost::spirit::ascii::space_type> fastaGr;

    fs::ifstream fin(this->file);
    if ( ! fin.is_open() ) {
        throw (string("FastaReader: Access denied: ") + this->file.string());
    }

    fin.unsetf(ios::skipws);

    iterator_type begin(fin);
    iterator_type end;

    pos_iterator_type pos_begin(begin, end, this->file.string());
    pos_iterator_type pos_end;

    fastaGr fG;
    try {
        std::cerr << "Measuring: Parsing." << std::endl;
        const pt::ptime startMeasurement = pt::microsec_clock::universal_time();

        qi::phrase_parse(pos_begin, pos_end, fG, boost::spirit::ascii::space, this->fV);

        const pt::ptime endMeasurement = pt::microsec_clock::universal_time();
        pt::time_duration duration (endMeasurement - startMeasurement);
        std::cerr << duration <<  std::endl;
    } catch (std::string str) {
        cerr << "error message: " << str << endl;
    }   
}
 
So the grammar does the folloing: It looks for a ">" sign and then stores all following characters until an EOL is detected. After the EOL the text sequence starts and ends when a ">" sign is detected. Both strings (header line and text sequence) are then stored in a std::vector by calling FastaReader::addValue(). 
I compiled my program using g++ version 4.8.2 with -O2 and -std=c++11 flags. 
So where is the the performance issue in my code? 

原文：https://stackoverflow.com/questions/31341067

更新时间：2022-02-27 19:02

最满意答案

 使Sorter::sortByNumber静态。 由于它不引用任何对象成员，因此不需要改变其他任何东西。  
class Sorter {
public:                   
    static bool sortByNumber(const D& d1, const D& d2);
    ...
};

// Note out-of-class definition does not repeat static
bool Sorter::sortByNumber(const D& d1, const D& d2)
{
    ...
}
 
 您还应该使用const引用，因为sortByNumber不应该修改对象。 

Make Sorter::sortByNumber static. Since it doesn't reference any object members, you won't need to change anything else. 
class Sorter {
public:                   
    static bool sortByNumber(const D& d1, const D& d2);
    ...
};

// Note out-of-class definition does not repeat static
bool Sorter::sortByNumber(const D& d1, const D& d2)
{
    ...
}
 
You should also use const references as sortByNumber should not be modifying the objects.

用Boost :: spirit编写的解析器的性能问题(Performance issue with parser written with Boost::spirit)

最满意答案

相关问答

TCP/IP模型是一个________。[2023-10-02]

下列中不属于面向对象的编程语言的是?[2022-05-30]

std :: sort没有仿函数(std::sort without functors)[2022-10-02]

在std :: sort上使用类方法比较器[复制](Use class method comparator on std::sort [duplicate])[2022-07-24]

std ::对执行移动构造函数的类进行排序(std::sort a class that's implemented a move constructor)[2022-04-05]

传递类方法而不是std :: sort中的函数(passing a class method as opposed to a function in std::sort)[2023-07-07]

在类中使用谓词函数对C ++ std :: sort进行排序(C++ std::sort with predicate function in Class)[2022-04-10]

比较std :: sort中的类函数无法编译(Compare function in a class for std::sort can't compile)[2023-06-06]

关于std :: sort的比较函数对象(About the comparison function object of std::sort)[2022-10-22]

重载自定义std :: sort比较函数(How can I overload a custom std::sort comparison function?)[2022-04-24]

相关文章

最新问答