首页 \ 问答 \ 用Boost :: spirit编写的解析器的性能问题(Performance issue with parser written with Boost::spirit)

用Boost :: spirit编写的解析器的性能问题(Performance issue with parser written with Boost::spirit)

我想分析一个看起来像这样的文件(类似FASTA的文本格式):

    >InfoHeader
    "Some text sequence that has a line break after every 80 characters"
    >InfoHeader
    "Some text sequence that has a line break after every 80 characters"
    ...

例如:

    >gi|31563518|ref|NP_852610.1| microtubule-associated proteins 1A/1B light chain 3A isoform b [Homo sapiens]
    MKMRFFSSPCGKAAVDPADRCKEVQQIRDQHPSKIPVIIERYKGEKQLPVLDKTKFLVPDHVNMSELVKI
    IRRRLQLNPTQAFFLLVNQHSMVSVSTPIADIYEQEKDEDGFLYMVYASQETFGFIRENE

我为boost :: spirit编写了一个解析器。 解析器正确地将标题行和以下文本序列存储在std::vector< std::pair< string, string >>但对于较大的文件需要很长的时间(对于100MB文件为17秒)。 作为比较,我写了一个没有boost :: spirit(只是STL函数)的程序,它简单地将这个100MB文件的每一行复制到一个std::vector 。 整个过程不到一秒钟。 用于比较的“程序”没有达到目的,但我认为解析器不应该花那么多时间......

我知道有很多其他的FASTA解析器,但我很好奇为什么我的代码很慢。

.hpp文件:

#include <boost/filesystem/path.hpp>

namespace fs = boost::filesystem;


class FastaReader {

public:
    typedef std::vector< std::pair<std::string, std::string> > fastaVector;

private:
    fastaVector fV;
    fs::path file;  

public:
    FastaReader(const fs::path & f);
    ~FastaReader();

    const fs::path & getFile() const;
    const fastaVector::const_iterator getBeginIterator() const;
    const fastaVector::const_iterator getEndIterator() const;   

private:
    void parse();

};

和.cpp文件:

#include <iomanip>
#include <boost/date_time/posix_time/posix_time.hpp>
#include <boost/filesystem/fstream.hpp>
#include <boost/filesystem/operations.hpp>
#include <boost/filesystem/path.hpp>
#include <boost/spirit/include/classic_position_iterator.hpp>
#include <boost/spirit/include/phoenix_bind.hpp>
#include <boost/spirit/include/phoenix_core.hpp>
#include <boost/spirit/include/phoenix_fusion.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>
#include <boost/spirit/include/qi.hpp>
#include "fastaReader.hpp"


using namespace std;

namespace fs = boost::filesystem;
namespace qi = boost::spirit::qi;
namespace pt = boost::posix_time;

template <typename Iterator, typename Skipper>
struct FastaGrammar : qi::grammar<Iterator, FastaReader::fastaVector(), qi::locals<string>, Skipper> {
    qi::rule<Iterator> infoLineStart;
    qi::rule<Iterator> inputEnd;
    qi::rule<Iterator> lineEnd;
    qi::rule<Iterator, string(), Skipper> infoLine;
    qi::rule<Iterator, string(), Skipper> seqLine;
    qi::rule<Iterator, FastaReader::fastaVector(), qi::locals<string>, Skipper> fasta;


    FastaGrammar() : FastaGrammar::base_type(fasta, "fasta") {
        using boost::spirit::standard::char_;
        using boost::phoenix::bind;
        using qi::eoi;
        using qi::eol;
        using qi::lexeme;
        using qi::_1;
        using qi::_val;
        using namespace qi::labels;

        infoLineStart = char_('>');
        inputEnd = eoi;

        /* grammar */       
        infoLine = lexeme[*(char_ - eol)];
        seqLine = *(char_ - infoLineStart);

        fasta = *(infoLineStart > infoLine[_a = _1] 
            > seqLine[bind(&FastaGrammar::addValue, _val, _a, _1)]
            )
            > inputEnd
        ;

        infoLineStart.name(">");
        infoLine.name("sequence identifier");
        seqLine.name("sequence");

    }

    static void addValue(FastaReader::fastaVector & fa, const string & info, const string & seq) {
        fa.push_back(make_pair(info, seq));
    }
};


FastaReader::FastaReader(const fs::path & f) {
    this->file = f; 
    this->parse();
}


FastaReader::~FastaReader() {}


const fs::path & FastaReader::getFile() const {
    return this->file;
}


const FastaReader::fastaVector::const_iterator FastaReader::getBeginIterator() const {
    return this->fV.cbegin();
}


const FastaReader::fastaVector::const_iterator FastaReader::getEndIterator() const {
    return this->fV.cend();
}


void FastaReader::parse() {
    if ( this->file.empty() ) throw string("FastaReader: No file specified.");
    if ( ! fs::is_regular_file(this->file) ) throw (string("FastaReader: File not found: ") + this->file.string());

    typedef boost::spirit::istream_iterator iterator_type;
    typedef boost::spirit::classic::position_iterator2<iterator_type> pos_iterator_type;
    typedef FastaGrammar<pos_iterator_type, boost::spirit::ascii::space_type> fastaGr;

    fs::ifstream fin(this->file);
    if ( ! fin.is_open() ) {
        throw (string("FastaReader: Access denied: ") + this->file.string());
    }

    fin.unsetf(ios::skipws);

    iterator_type begin(fin);
    iterator_type end;

    pos_iterator_type pos_begin(begin, end, this->file.string());
    pos_iterator_type pos_end;

    fastaGr fG;
    try {
        std::cerr << "Measuring: Parsing." << std::endl;
        const pt::ptime startMeasurement = pt::microsec_clock::universal_time();

        qi::phrase_parse(pos_begin, pos_end, fG, boost::spirit::ascii::space, this->fV);

        const pt::ptime endMeasurement = pt::microsec_clock::universal_time();
        pt::time_duration duration (endMeasurement - startMeasurement);
        std::cerr << duration <<  std::endl;
    } catch (std::string str) {
        cerr << "error message: " << str << endl;
    }   
}

所以语法会做以下事情:它会查找“>”符号,然后存储所有后续字符,直到检测到EOL。 在EOL之后,当检测到“>”符号时,文本序列开始并结束。 然后通过调用FastaReader::addValue()两个字符串(标题行和文本序列)存储在std :: vector中。

我使用g ++版本4.8.2编译我的程序,使用-O2和-std = c ++ 11标志。

那么我的代码中的性能问题在哪里?


I want to parse a file that looks like this (FASTA-like text format):

    >InfoHeader
    "Some text sequence that has a line break after every 80 characters"
    >InfoHeader
    "Some text sequence that has a line break after every 80 characters"
    ...

e.g.:

    >gi|31563518|ref|NP_852610.1| microtubule-associated proteins 1A/1B light chain 3A isoform b [Homo sapiens]
    MKMRFFSSPCGKAAVDPADRCKEVQQIRDQHPSKIPVIIERYKGEKQLPVLDKTKFLVPDHVNMSELVKI
    IRRRLQLNPTQAFFLLVNQHSMVSVSTPIADIYEQEKDEDGFLYMVYASQETFGFIRENE

I wrote a parser for this with boost::spirit. The parser correctly stores the header line and the following text sequence in a std::vector< std::pair< string, string >> but it takes kind of long for bigger files (17sec for a 100MB file). As comparison I wrote a program without boost::spirit (just STL functions) that simply copies each line of that 100MB file in a std::vector. The whole process takes less than a second. The "program" used for the comparison is not serving the purpose but I don't think the parser should take that much longer...

I know there are plenty of other FASTA parsers around but I'm rather curious why my code is slow.

The .hpp file:

#include <boost/filesystem/path.hpp>

namespace fs = boost::filesystem;


class FastaReader {

public:
    typedef std::vector< std::pair<std::string, std::string> > fastaVector;

private:
    fastaVector fV;
    fs::path file;  

public:
    FastaReader(const fs::path & f);
    ~FastaReader();

    const fs::path & getFile() const;
    const fastaVector::const_iterator getBeginIterator() const;
    const fastaVector::const_iterator getEndIterator() const;   

private:
    void parse();

};

And the .cpp file:

#include <iomanip>
#include <boost/date_time/posix_time/posix_time.hpp>
#include <boost/filesystem/fstream.hpp>
#include <boost/filesystem/operations.hpp>
#include <boost/filesystem/path.hpp>
#include <boost/spirit/include/classic_position_iterator.hpp>
#include <boost/spirit/include/phoenix_bind.hpp>
#include <boost/spirit/include/phoenix_core.hpp>
#include <boost/spirit/include/phoenix_fusion.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>
#include <boost/spirit/include/qi.hpp>
#include "fastaReader.hpp"


using namespace std;

namespace fs = boost::filesystem;
namespace qi = boost::spirit::qi;
namespace pt = boost::posix_time;

template <typename Iterator, typename Skipper>
struct FastaGrammar : qi::grammar<Iterator, FastaReader::fastaVector(), qi::locals<string>, Skipper> {
    qi::rule<Iterator> infoLineStart;
    qi::rule<Iterator> inputEnd;
    qi::rule<Iterator> lineEnd;
    qi::rule<Iterator, string(), Skipper> infoLine;
    qi::rule<Iterator, string(), Skipper> seqLine;
    qi::rule<Iterator, FastaReader::fastaVector(), qi::locals<string>, Skipper> fasta;


    FastaGrammar() : FastaGrammar::base_type(fasta, "fasta") {
        using boost::spirit::standard::char_;
        using boost::phoenix::bind;
        using qi::eoi;
        using qi::eol;
        using qi::lexeme;
        using qi::_1;
        using qi::_val;
        using namespace qi::labels;

        infoLineStart = char_('>');
        inputEnd = eoi;

        /* grammar */       
        infoLine = lexeme[*(char_ - eol)];
        seqLine = *(char_ - infoLineStart);

        fasta = *(infoLineStart > infoLine[_a = _1] 
            > seqLine[bind(&FastaGrammar::addValue, _val, _a, _1)]
            )
            > inputEnd
        ;

        infoLineStart.name(">");
        infoLine.name("sequence identifier");
        seqLine.name("sequence");

    }

    static void addValue(FastaReader::fastaVector & fa, const string & info, const string & seq) {
        fa.push_back(make_pair(info, seq));
    }
};


FastaReader::FastaReader(const fs::path & f) {
    this->file = f; 
    this->parse();
}


FastaReader::~FastaReader() {}


const fs::path & FastaReader::getFile() const {
    return this->file;
}


const FastaReader::fastaVector::const_iterator FastaReader::getBeginIterator() const {
    return this->fV.cbegin();
}


const FastaReader::fastaVector::const_iterator FastaReader::getEndIterator() const {
    return this->fV.cend();
}


void FastaReader::parse() {
    if ( this->file.empty() ) throw string("FastaReader: No file specified.");
    if ( ! fs::is_regular_file(this->file) ) throw (string("FastaReader: File not found: ") + this->file.string());

    typedef boost::spirit::istream_iterator iterator_type;
    typedef boost::spirit::classic::position_iterator2<iterator_type> pos_iterator_type;
    typedef FastaGrammar<pos_iterator_type, boost::spirit::ascii::space_type> fastaGr;

    fs::ifstream fin(this->file);
    if ( ! fin.is_open() ) {
        throw (string("FastaReader: Access denied: ") + this->file.string());
    }

    fin.unsetf(ios::skipws);

    iterator_type begin(fin);
    iterator_type end;

    pos_iterator_type pos_begin(begin, end, this->file.string());
    pos_iterator_type pos_end;

    fastaGr fG;
    try {
        std::cerr << "Measuring: Parsing." << std::endl;
        const pt::ptime startMeasurement = pt::microsec_clock::universal_time();

        qi::phrase_parse(pos_begin, pos_end, fG, boost::spirit::ascii::space, this->fV);

        const pt::ptime endMeasurement = pt::microsec_clock::universal_time();
        pt::time_duration duration (endMeasurement - startMeasurement);
        std::cerr << duration <<  std::endl;
    } catch (std::string str) {
        cerr << "error message: " << str << endl;
    }   
}

So the grammar does the folloing: It looks for a ">" sign and then stores all following characters until an EOL is detected. After the EOL the text sequence starts and ends when a ">" sign is detected. Both strings (header line and text sequence) are then stored in a std::vector by calling FastaReader::addValue().

I compiled my program using g++ version 4.8.2 with -O2 and -std=c++11 flags.

So where is the the performance issue in my code?


原文:https://stackoverflow.com/questions/31341067
更新时间:2022-02-27 19:02

最满意答案

使Sorter::sortByNumber静态。 由于它不引用任何对象成员,因此不需要改变其他任何东西。

class Sorter {
public:                   
    static bool sortByNumber(const D& d1, const D& d2);
    ...
};

// Note out-of-class definition does not repeat static
bool Sorter::sortByNumber(const D& d1, const D& d2)
{
    ...
}

您还应该使用const引用,因为sortByNumber不应该修改对象。


Make Sorter::sortByNumber static. Since it doesn't reference any object members, you won't need to change anything else.

class Sorter {
public:                   
    static bool sortByNumber(const D& d1, const D& d2);
    ...
};

// Note out-of-class definition does not repeat static
bool Sorter::sortByNumber(const D& d1, const D& d2)
{
    ...
}

You should also use const references as sortByNumber should not be modifying the objects.

相关问答

更多
  • 是的,只要要排序的范围中的值类型具有定义“严格弱排序”的operator < ,也就是说,它可以用于正确比较两个MyTest实例。 你可以这样做: class MyTest { ... bool operator <(const MyTest &rhs) const { return m_first
  • 是的,你可以使用boost::bind : #include #include #include #include struct S { bool ascending; bool Compare(int lhs, int rhs) { return ascending ? (lhs < rhs) : (rhs < lhs); } }; int main () { int i[] ...
  • 根据std :: sort的这个参考,你的类需要是可移动的可移植的 ,并且可移动的以便可排序 。 如果你的班级本身不具有可移动性,那么类似这样的东西应该可以正常工作: class myclass { public: myclass(myclass const&) = delete; // or implement myclass(myclass&& other) { /* stuff to construct*/ } myclass& operator=(myclass const ...
  • 使Sorter::sortByNumber静态。 由于它不引用任何对象成员,因此不需要改变其他任何东西。 class Sorter { public: static bool sortByNumber(const D& d1, const D& d2); ... }; // Note out-of-class definition does not repeat static bool Sorter::sortByNumber(const D& d1, ...
  • 在第一种情况下, cmp被声明为class sample的成员函数,因此需要使用this指针来调用它。 由于this指针不可用,编译器正在抱怨它。 你可以通过将cmp声明为static函数来使其工作,因为静态函数不需要这个指针来调用。 在第二种情况下,由于cmp再次被声明为独立函数,它的行为与静态函数相同。 在第三种情况下(带有重载操作符),排序算法将负责调用向量中的每个对象的函数,并因此编译它。 In the first case cmp is declared as a member function ...
  • 问题是你的cmp方法需要是static 。 原因是非静态方法期望一个不可见的第一个参数,即this指针。 std::sort函数不传递这个额外的参数。 由于您引用成员变量,因此无法使该函数成为static函数,但还有其他方法可以解决此问题。 我建议使用新的C ++ 11标准功能, std::bind : std::sort(a, a+n, std::bind(&ConvexHull::cmp, this)); std::bind调用创建一个可调用对象,将第一个参数设置为this这样在调用时它是正确的。 T ...
  • 在示例1中,您声明了一个名为less_than_key的结构,但是您没有此结构的任何实例,您只有一个声明。 在LINE 1中,您调用struct的构造函数并创建std :: sort函数使用的对象的实例。 在示例2中,它是不同的。 你也声明了一个struct,这次没有名字(匿名),但区别在于你正在创建一个这种类型的实例“customLess”(隐式调用构造函数)。 因此,由于您已经创建了实例,因此无需在std :: sort函数中创建它。 看一下结构声明的区别: 1.- struct less_than_k ...
  • 由于u是std::pair的向量,因此您需要调用相应的比较函数。 由于仅仅名称是不够的,因此必须通过将其转换为具有正确函数指针类型的指针来为编译器区分它。 在你的情况下,它是一个函数,它对对类型采用两个const引用并返回一个bool - 所以你必须强制转换为的函数指针类型: bool (*)(const std::pair &, const std::pair &) 这是一个非常丑陋的演员: std::sort(u.beg ...

相关文章

更多

最新问答

更多
  • 您如何使用git diff文件,并将其应用于同一存储库的副本的本地分支?(How do you take a git diff file, and apply it to a local branch that is a copy of the same repository?)
  • 将长浮点值剪切为2个小数点并复制到字符数组(Cut Long Float Value to 2 decimal points and copy to Character Array)
  • OctoberCMS侧边栏不呈现(OctoberCMS Sidebar not rendering)
  • 页面加载后对象是否有资格进行垃圾回收?(Are objects eligible for garbage collection after the page loads?)
  • codeigniter中的语言不能按预期工作(language in codeigniter doesn' t work as expected)
  • 在计算机拍照在哪里进入
  • 使用cin.get()从c ++中的输入流中丢弃不需要的字符(Using cin.get() to discard unwanted characters from the input stream in c++)
  • No for循环将在for循环中运行。(No for loop will run inside for loop. Testing for primes)
  • 单页应用程序:页面重新加载(Single Page Application: page reload)
  • 在循环中选择具有相似模式的列名称(Selecting Column Name With Similar Pattern in a Loop)
  • System.StackOverflow错误(System.StackOverflow error)
  • KnockoutJS未在嵌套模板上应用beforeRemove和afterAdd(KnockoutJS not applying beforeRemove and afterAdd on nested templates)
  • 散列包括方法和/或嵌套属性(Hash include methods and/or nested attributes)
  • android - 如何避免使用Samsung RFS文件系统延迟/冻结?(android - how to avoid lag/freezes with Samsung RFS filesystem?)
  • TensorFlow:基于索引列表创建新张量(TensorFlow: Create a new tensor based on list of indices)
  • 企业安全培训的各项内容
  • 错误:RPC失败;(error: RPC failed; curl transfer closed with outstanding read data remaining)
  • C#类名中允许哪些字符?(What characters are allowed in C# class name?)
  • NumPy:将int64值存储在np.array中并使用dtype float64并将其转换回整数是否安全?(NumPy: Is it safe to store an int64 value in an np.array with dtype float64 and later convert it back to integer?)
  • 注销后如何隐藏导航portlet?(How to hide navigation portlet after logout?)
  • 将多个行和可变行移动到列(moving multiple and variable rows to columns)
  • 提交表单时忽略基础href,而不使用Javascript(ignore base href when submitting form, without using Javascript)
  • 对setOnInfoWindowClickListener的意图(Intent on setOnInfoWindowClickListener)
  • Angular $资源不会改变方法(Angular $resource doesn't change method)
  • 在Angular 5中不是一个函数(is not a function in Angular 5)
  • 如何配置Composite C1以将.m和桌面作为同一站点提供服务(How to configure Composite C1 to serve .m and desktop as the same site)
  • 不适用:悬停在悬停时:在元素之前[复制](Don't apply :hover when hovering on :before element [duplicate])
  • 常见的python rpc和cli接口(Common python rpc and cli interface)
  • Mysql DB单个字段匹配多个其他字段(Mysql DB single field matching to multiple other fields)
  • 产品页面上的Magento Up出售对齐问题(Magento Up sell alignment issue on the products page)