知识点
相关文章
更多最近更新
更多How Digg is Built
2019-03-27 01:02|来源: 网路
At Digg we have substantially rebuilt our infrastructure over the last year in what we call "Digg V4". This blog post gives a high-level view of the systems and technologies involved and how we use them. Read on to find out the secrets of the Digg engineers!
Let us start by reviewing the public products that Digg provides to users:
- a social news site for all users,
- a personalized social news site for an individual,
- an Ads platform,
- an API service and
- blog and documentation sites.
These sites are primarily accessed by people visiting in browsers or applications. Some people have Digg user accounts and are logged in; they get the personalized site My News. Everyone gets the all user site which we call Top News. These products are all seen on 'digg.com
' and the mobile site 'm.digg.com
'. The API service is at 'services.digg.com
'. Finally, there are the 'about.digg.com
' (this one) and 'developers.digg.com
' sites which together provide the company blog and documentation for users, publishers, advertisers and developers.
This post will mainly cover the high-level technology of the social news products.
What we are trying to do
We are trying to build social news sites based on user-submitted stories and advertiser-submitted ad content.
Story Submission: The stories are submitted by logged-in users with some descriptive fields: a title, a paragraph, a media type, a topic and a possible thumbnail. These fields are extracted from the source document by a variety of metadata standards (such as the Facebook open graph protocol, OEmbed plus some filtering) but the submitter has the final edit on all of these. Ads are submitted by publishers to a separate system but if Dugg enough, may become stories.
Story Lists: All stories are shown in multiple "story lists" such as Most Recent (by date, newest first), by topic of the story, by media type and if you follow the submitted user, in the personalized social news product, MyNews.
Story Actions: Users can do actions on the stories and story lists such as read them, click on them, Digg them, Bury them, make comments, vote on the comments and more. A non-logged in user can only read or click stories.
Story Promotion: Several times per hour we determine stories to move from recent story lists to the Top News story list. The algorithm (our secret sauce!) picks the stories by looking at both user actions and content classification features.
How do we do it?
Let us take a look at a high level view of how somebody visiting one of the Digg sites gets served with content and can do actions. The following picture shows the public view and the boundary to the internal services that Digg uses to provide the Pages, Images or API requests.
The edge of our internal systems is simplified here but does show that the API Servers proxy requests to our internal back end services servers. The front end servers are virtually stateless (apart from some caching) and rely on the same service layer. The CMS and Ads systems will not be described further in this post.
Taking a look at the internal high level services in an abstract fashion, these can be generally divided into two system parts:
- Online or Interactive or Synchronous
- Serve user requests for a page or API directly or indirectly. These have to return the response in some number of milliseconds (for services) which in aggregate to the user cannot be more than 1 or 2 seconds to give a good page response. This includes AJAX requests which are asynchronous for the user in the browser but are request/response from the serving system point of view.
- Offline or Batch or Asynchronous
- Serve requests that are not in the interactive request-response loop and are typically only indirectly initiated by a user. The work here can take from seconds, minutes or hours (rarely).
The two parts above are used in Digg as shown in this diagram:
Looking deeper into the components.
Online Systems
The applications serving pages or API requests are mainly written in PHP (Front End, Drupal CMS) and Python (API server) using Tornado. They call the back end services via the Thrift protocol to a set of services written in Python. Many things are cached in the online applications (FE and BE) using Memcached and Redis; some items are primarily stored in Redis too, described below.
Messaging and Events
The online and offline worlds are connected in a synchronous fashion by calls to the primary data stores, transient / logging systems and in an asynchronous way using RabbitMQ to queue up events that have happened like "a user Dugg a story" or jobs to perform such as "please compute this thing".
Batch and Asynchronous Systems
When a message is found in a queue, a job worker is called to perform the specific action. Some messages are triggered by a time based cron-like mechanism too. The workers then typically work on some of the data in the primary stores or offline stores e.g. Logs in HDFS and then usually write the results back into one of the primary stores so that the online services can use them. Examples here are things like indexing new stories, calculating the promotion algorithm, running analytics jobs over site activity.
Data Stores
Digg stores data in multiple types of systems depending on the type of data and the access patterns, and also for historical reasons in some cases :)
Cassandra: The primary store for "Object-like" access patterns for such things as Items (stories), Users, Diggs and the indexes that surround them. Since the Cassandra 0.6 version we use does not support secondary indexes, these are computed by application logic and stored here. This allows the services to look up, for example, a user by their username or email address rather than the user ID. We use it via the Python Lazyboy wrapper.
HDFS: Logs from site and API events, user activity. Data source and destination for batch jobs run with Map-Reduce and Hive in Hadoop. Big Data and Big Compute!
MogileFS: Stores image binaries for user icons, screenshots and other static assets. This is the backend store for the CDN origin servers which are an aspect of the Front End systems and can be fronted by different CDN vendors.
MySQL: This is mainly the current store for the story promotion algorithm and calculations, because it requires lots of JOIN
heavy operations which is not a natural fit for the other data stores at this time. However... HBase looks interesting.
Redis: The primary store for the personalized news data because it needs to be different for every user and quick to access and update. We use Redis to provide the Digg Streaming APIand also for the real time view and click counts since it provides super low latency as a memory-based data storage system.
SOLR: Used as the search index for text queries along with some structured fields like date, topic.
Scribe: the log collecting service. Although this is a primary store, the logs are rotated out of this system regularly and summaries written to HDFS.
Operating System and Configuration
Digg runs on Debian stable based GNU/Linux servers which we configure with Clusto, Puppetand using a configuration system over Zookeeper.
More
In future blog posts we will describe in more detail some of the systems outlined here. Watch this space!
If you have feedback on this post or suggestions on what we should write about, please let us know. This post was written by Dave (@dajobe).
This could be considered a followup to the How Digg Works post from 2008 describing the earlier architecture.
转自:http://www.cnblogs.com/wufawei/archive/2012/03/25/2416853
相关问答
更多-
这个是用CBV测试的,它是这篇博客文章中代码的修改版本。 分页添加在模板标记中,因此您只需为使用ListView呈现的任何模板加载分页,例如,对于listing-base.html : {% load paginator_tags %} {% get_pagination 2 1 %} get_pagination是paginator_tags.py定义的模板标记。 更改first_last_amount和before_after_amount可以控制要显示的页数 ...
-
我得到了我的解决方案,我发现它被称为“有用的答案”的vBulletin产品,它具有像我想要的大拇指上下功能,并具有其他功能。 I got my solution i found product for vBulletin which is called "Helpful Answers", it has thumbs up and down feature like i wanted and has couple of other features too.
-
一种方法是在每个帖子中添加一组“朋友”。 { date: Date(...) friends: ['me', 'you', 'thatguy'] ... } db.posts.ensureIndex({friends:1, date:-1}) 然后你可以通过这样做轻松地显示我的页面: db.posts.find({friends:'me'}).sort({date:-1}) 只要每个用户少于约200,000个朋友,这将有效; 您可能需要来自具有更多内容的用户的特殊情况帖子。 一种方法是将朋友列 ...
-
Digg样式Ajax投票按钮(Digg Style Ajax Voting Button)[2021-09-28]
在不了解您的知识水平的情况下回答您的问题非常困难。 为了实现这种功能,您需要一些通用组件: 你需要一个实现“投票”概念的“网络服务” - 你使用ASP.Net MVC,ASP.Net和WebForms等框架实现这个网络服务。 调用Web服务时,它将增加或减少投票计数 您使用客户端AJAX库(如JQuery),或使用ASP.Net中内置的Web服务客户端代码支持 - 此代码在浏览器中运行 - 您需要连接此代码,以便在用户选择投票时在UI中,客户端代码调用您的Web服务来增加或减少投票。 这就是一般而言。 写这 ... -
内置CastOrDefault?(Built in CastOrDefault?)[2022-06-01]
如果传递null, Convert.ToDecimal和Convert.ToInt32将返回零。 Convert.ToDecimal and Convert.ToInt32 will return zero if passed null. -
对像Digg这样的建模进行建模的最佳方法是什么?(What is the best way to go about modeling something like Digg?)[2022-06-24]
reddit.com是现在废弃的digg.com的更好版本,猜猜是什么,reddit的代码和排名机制是开源的。 在这里找到reddit的开发者页面。 Reddit排名算法的工作原理(amix.dk) https://reddit.com is a better version of the now abandoned Digg and guess what, reddit's code and ranking mechanism is open source. Find the developer page ... -
Facebook提供了大量的文档 ,包括PHP网站示例的全部源代码,“Run Around” ,它具有所有必要的功能。 有很多有用的东西,包括Facebook,浏览器和开发者文档维基上的您的网站之间的确切步骤 。 阅读文档后您有什么具体问题? Facebook provides extensive documentation, including the full source of a sample PHP website, "The Run Around", that has all the neces ...
-
Solr的FunctionQuery正是您所需要的: http://wiki.apache.org/solr/FunctionQuery Solr's FunctionQuery is exactly what you need: http://wiki.apache.org/solr/FunctionQuery
-
我使用谷歌阅读器与3个不同的文件夹。 “个人”适用于每天只更新1-2次的网站,我会仔细阅读。 “新闻”适用于像纽约时报这样的网站,每天更新10次,我希望至少浏览一下有趣的内容。 “浏览”是针对内容泛滥的网站,当我有时间时我会浏览它。 这样,你可以确保你跟踪所有重要的内容,并且看到像SO这样的东西。 I use Google Reader with 3 different folders. "Personal" is for sites which only update 1 - 2 times a day ...
-
Digg如何从搜索结果网址中删除“&x = 0&y = 0”?(How Does Digg remove “&x=0&y=0” from their Search Results URL?)[2021-12-21]
Digg 正在使用JavaScript来做到这一点。 尝试在浏览器中禁用JavaScript的情况下提交搜索表单。 Digg is using JavaScript to do that. Try submitting the search form with JavaScript disabled in your browser.