首页 \ 问答 \ Nutch 1.4与Solr 3.4 - 无法抓取网址,“无法抓取网址”(Nutch 1.4 with Solr 3.4 - can't crawl URL, “no URLs to fetch”)

Nutch 1.4与Solr 3.4 - 无法抓取网址,“无法抓取网址”(Nutch 1.4 with Solr 3.4 - can't crawl URL, “no URLs to fetch”)

我跟着一篇关于使用cygwin,tomcat,nutch 1.4和solr 3.4进行网络爬行的教程。 我已经可以抓取一次网址,但不管怎样,这不再起作用,无论我尝试使用哪个网址。 运行时/ local / conf中的regex-urlfilter.txt如下所示:

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
 +^http://([a-z0-9]*\.)*nutch.apache.org/

运行时/ local / bin / url的seed.txt中的唯一URL只是http://nutch.apache.org/

为了爬行我使用命令

$ ./nutch crawl urls -dir newCrawl3 -solr http://localhost:8080/solr/ -depth 2 -topN 3

控制台输出是:

cygpath: can't convert empty path
crawl started in: newCrawl3
rootUrlDir = urls
threads = 10
depth = 2
solrUrl=http://localhost:8080/solr/
topN = 3
Injector: starting at 2017-05-18 17:03:25
Injector: crawlDb: newCrawl3/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2017-05-18 17:03:28, elapsed: 00:00:02
Generator: starting at 2017-05-18 17:03:28
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 3
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: newCrawl3

我知道有几个类似的问题,但其中大部分都没有解决。 谁能帮忙?

非常感谢您提前!


I followed a tutorial for web-crawling with Nutch using cygwin, tomcat, nutch 1.4 and solr 3.4. I already could crawl an URL once, but somehow this doesn't work anymore, no matter which URL i try. My regex-urlfilter.txt in runtime/local/conf is as following:

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
 +^http://([a-z0-9]*\.)*nutch.apache.org/

The only URL in my seed.txt in runtime/local/bin/urls is only http://nutch.apache.org/.

For crawling I use command

$ ./nutch crawl urls -dir newCrawl3 -solr http://localhost:8080/solr/ -depth 2 -topN 3

Console output is:

cygpath: can't convert empty path
crawl started in: newCrawl3
rootUrlDir = urls
threads = 10
depth = 2
solrUrl=http://localhost:8080/solr/
topN = 3
Injector: starting at 2017-05-18 17:03:25
Injector: crawlDb: newCrawl3/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2017-05-18 17:03:28, elapsed: 00:00:02
Generator: starting at 2017-05-18 17:03:28
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 3
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: newCrawl3

I know there are a few similar questions, but most of them are not resolved. Can anyone help?

Thank you very much in advance!


原文:https://stackoverflow.com/questions/44052230
更新时间:2021-12-12 16:12

最满意答案

替代方法:

您可以创建一系列表示进度更新的图像,然后在进程的每个步骤中使用setImage:forState:方法替换UIButton currentImage属性。 这不需要在现有视图中绘制,这种方法对我来说很有效,可以显示图像(按钮或其他)的简单“动画”。

这种方法对你有用吗? 如果没有,为什么不呢?

巴特


This was really bugging me so after dealing with a series of silly, but necessary issues regarding the project I want this functionality for, I played around with it. The end result is that I can now arbitrarily draw an arc representing the progress of a particular background task to a button.

The goal was to draw something like the little indicator in the lower right hand corner of the XCode windows while a project is being cleaned or compiled.

I created a function that will draw and fill an arc and return it as a UIImage.

The worker thread calls method (PerformSelectorOnMainThread) with the current values and a button identifier. In the called method, I call the arc image function with the percentage filled and such.

example call:

oImg = [self ArcImageCreate:100.0f fWidth:100.0f 
    fPercentFilled: 0.45f fAngleStart: 0.0f xFillColor:[UIColor blueColor]];

Then set the background image of the button:

[oBtn setBackgroundImage: oImg forState: UIControlStateNormal];

Here is the function:
It is not finished, but it works well enough to illustrate how I am doing this.



/**
ArcImageCreate
@ingroup    UngroupedFunctions
@brief  Create a filled or unfilled solid arc and return it as a UIImage.

        Allows for dynamic / arbitrary update of an object that allows a UIImage to be drawn on it. \
    This can be used for some sort of pie chart or progress indicator by Image Flipping.

@param  fHeight     The height of the created UIImage.
@param  fWidth      The width of the created UIImage.
@param  fPercentFilled  A percentage of the circle to be filled by the arc. 0.0 to 1.0.
@param  AngleStart  The angle where the arc should start. 0 to 360. Clock Reference.
@param  xFillColor  The color of the filled area.

@return Pointer to a UIImage.

@todo [] Validate object creation at each step.

@todo [] Convert PercentFilled (0.0 to 1.0) to appropriate radian(?) (-3.15 to +3.15)

@todo [] Background Image Support. Allow for the arc to be drawn on top of an image \
    and the whole thing returned.

@todo [] Background Image Reduction. Background images will have to be resized to fit the specfied size. \
    Do not want to return a 65KB object because the background is 60K or whatever.

@todo [] UIColor RGBA Components. Determine a decent method of extracting RGVA values \
    from a UIColor*. Check out arstechnica.com/apple/guides/2009/02/iphone-development-accessing-uicolor-components.ars \
    for an idea.

*/
- (UIImage*) ArcImageCreate: (float)fHeight fWidth:(float)fWidth fPercentFilled:(float)fPercentFilled fAngleStart:(float)fAngleStart xFillColor:(UIColor*)xFillColor
{
    UIImage* fnRez = nil;
    float fArcBegin = 0.0f;
    float fArcEnd = 0.0f;
    float fArcPercent = 0.0f;
    UIColor* xArcColor = nil;
    float fArcImageWidth = 0.0f;
    float fArcImageHeight = 0.0f;
    CGRect xArcImageRect;

    CGContextRef xContext = nil;
    CGColorSpaceRef xColorSpace;
    void* xBitmapData;
    int iBMPByteCount;
    int iBMPBytesPerRow;
    float fPI = 3.14159;
    float fRadius = 25.0f;

// @todo Force default of 100x100 px if out of bounds. \
//  Check max image dimensions for iPhone. \
//  If negative, flip values *if* values are 'reasonable'. \
//  Determine minimum useable pixel dimensions. 10x10 px is too small. Or is it?
    fArcImageWidth = fHeight;
    fArcImageHeight = fWidth;

    // Get the passed target percentage and clip it between 0.0 and 1.0
    fArcPercent = (fPercentFilled  1.0f) ? 1.0f : fPercentFilled;
    fArcPercent = (fArcPercent > 1.0f) ? 1.0f : fArcPercent;

    // Get the passed start angle and clip it between 0.0 to 360.0
    fArcBegin = (fAngleStart  359.0f) ? 0.0f : fAngleStart;
    fArcBegin = (fArcBegin > 359.0f) ? 0.0f : fArcBegin;

    fArcBegin = (fArcBegin * fPI) / 180.0f;
    fArcEnd = ((360.0f * fArcPercent) * fPI) / 180.0f;

    // 
    if (xFillColor == nil) {
        // random color
    } else {
        xArcColor = xFillColor;
    }

    // Calculate memory required for image.
    iBMPBytesPerRow = (fArcImageWidth * 4);
    iBMPByteCount = (iBMPBytesPerRow * fArcImageHeight);
    xBitmapData = malloc(iBMPByteCount);

    // Create a color space. Behavior changes at OSXv10.4. Do not rely on it for consistency across devices.
    xColorSpace = CGColorSpaceCreateDeviceRGB();

    // Set the system to draw. Behavior changes at OSXv10.3.
    //  Both of these work. Not sure which is better.
//  xContext = CGBitmapContextCreate(xBitmapData, fArcImageWidth, fArcImageHeight, 8, iBMPBytesPerRow, xColorSpace, kCGImageAlphaPremultipliedFirst);
    xContext = CGBitmapContextCreate(NULL, fArcImageWidth, fArcImageHeight, 8, iBMPBytesPerRow, xColorSpace, kCGImageAlphaPremultipliedFirst);

    // Let the system know the colorspace reference is no longer required.
    CGColorSpaceRelease(xColorSpace);

    // Set the created context as the current context.
//  UIGraphicsPushContext(xContext);

    // Define the image's box.
    xArcImageRect = CGRectMake(0.0f, 0.0f, fArcImageWidth, fArcImageHeight);

    // Clear the image's box.
//  CGContextClearRect(xContext, xRect);

    // Draw the ArcImage's background image.
//  CGContextDrawImage(xContext, xArcImageRect, [oBackgroundImage CGImage]);

    // Set Us Up The Transparent Drawing Area.
    CGContextBeginTransparencyLayer(xContext, nil);

    // Set the fill and stroke colors
// @todo [] Determine why SetFilColor does not. Use alternative method.
//  CGContextSetFillColor(xContext, CGColorGetComponents([xArcColor CGColor]));
//  CGContextSetFillColorWithColor(xContext, CGColorGetComponents([xArcColor CGColor]));
// Test Colors
    CGContextSetRGBFillColor(xContext, 0.3f, 0.4f, 0.5f, 1.0f);
    CGContextSetRGBStrokeColor(xContext, 0.5f, 0.6f, 0.7f, 1.0f);
    CGContextSetLineWidth(xContext, 1.0f);

// Something like this to reverse drawing?
//  CGContextTranslateCTM(xContext, TranslateXValue, TranslateYValue);
//  CGContextScaleCTM(xContext, -1.0f, 1.0f); or CGContextScaleCTM(xContext, 1.0f, -1.0f);

    // Test Vals
//  fArcBegin = 45.0f * fPI / 180.0f;   // 0.785397
//  fArcEnd = 90.0f * fPI / 180.0f; // 1.570795

    // Move to the start point and draw the arc.
    CGContextMoveToPoint(xContext, fArcImageWidth/2.0f, fArcImageHeight/2.0f);
    CGContextAddArc(xContext, fArcImageWidth/2.0f, fArcImageHeight/2.0f, fRadius, fArcBegin, fArcEnd, 0);


    // Ask the OS to close the arc (current point to starting point). 
    CGContextClosePath(xContext);

    // Fill 'er up. Implicit path closure.
    CGContextFillPath(xContext);
//  CGContextEOFillPath(context);

    // Close Transparency drawing area.
    CGContextEndTransparencyLayer(xContext);

    // Create an ImageReference and create a UIImage from it.
    CGImageRef xCGImageTemp = CGBitmapContextCreateImage(xContext);
    CGContextRelease(xContext);
    fnRez = [UIImage imageWithCGImage: xCGImageTemp];
    CGImageRelease(xCGImageTemp);

//  UIGraphicsPopContext;

    return fnRez;
}


相关问答

更多

相关文章

更多

最新问答

更多
  • 您如何使用git diff文件,并将其应用于同一存储库的副本的本地分支?(How do you take a git diff file, and apply it to a local branch that is a copy of the same repository?)
  • 将长浮点值剪切为2个小数点并复制到字符数组(Cut Long Float Value to 2 decimal points and copy to Character Array)
  • OctoberCMS侧边栏不呈现(OctoberCMS Sidebar not rendering)
  • 页面加载后对象是否有资格进行垃圾回收?(Are objects eligible for garbage collection after the page loads?)
  • codeigniter中的语言不能按预期工作(language in codeigniter doesn' t work as expected)
  • 在计算机拍照在哪里进入
  • 使用cin.get()从c ++中的输入流中丢弃不需要的字符(Using cin.get() to discard unwanted characters from the input stream in c++)
  • No for循环将在for循环中运行。(No for loop will run inside for loop. Testing for primes)
  • 单页应用程序:页面重新加载(Single Page Application: page reload)
  • 在循环中选择具有相似模式的列名称(Selecting Column Name With Similar Pattern in a Loop)
  • System.StackOverflow错误(System.StackOverflow error)
  • KnockoutJS未在嵌套模板上应用beforeRemove和afterAdd(KnockoutJS not applying beforeRemove and afterAdd on nested templates)
  • 散列包括方法和/或嵌套属性(Hash include methods and/or nested attributes)
  • android - 如何避免使用Samsung RFS文件系统延迟/冻结?(android - how to avoid lag/freezes with Samsung RFS filesystem?)
  • TensorFlow:基于索引列表创建新张量(TensorFlow: Create a new tensor based on list of indices)
  • 企业安全培训的各项内容
  • 错误:RPC失败;(error: RPC failed; curl transfer closed with outstanding read data remaining)
  • C#类名中允许哪些字符?(What characters are allowed in C# class name?)
  • NumPy:将int64值存储在np.array中并使用dtype float64并将其转换回整数是否安全?(NumPy: Is it safe to store an int64 value in an np.array with dtype float64 and later convert it back to integer?)
  • 注销后如何隐藏导航portlet?(How to hide navigation portlet after logout?)
  • 将多个行和可变行移动到列(moving multiple and variable rows to columns)
  • 提交表单时忽略基础href,而不使用Javascript(ignore base href when submitting form, without using Javascript)
  • 对setOnInfoWindowClickListener的意图(Intent on setOnInfoWindowClickListener)
  • Angular $资源不会改变方法(Angular $resource doesn't change method)
  • 在Angular 5中不是一个函数(is not a function in Angular 5)
  • 如何配置Composite C1以将.m和桌面作为同一站点提供服务(How to configure Composite C1 to serve .m and desktop as the same site)
  • 不适用:悬停在悬停时:在元素之前[复制](Don't apply :hover when hovering on :before element [duplicate])
  • 常见的python rpc和cli接口(Common python rpc and cli interface)
  • Mysql DB单个字段匹配多个其他字段(Mysql DB single field matching to multiple other fields)
  • 产品页面上的Magento Up出售对齐问题(Magento Up sell alignment issue on the products page)