Tuesday, June 14, 2016

Yes, Blockchain Is Going To Change The World

A recent Forbes article "Blockchain Is Not Going To Change The World" motivated me to write why I think Blockchains will change the world, perhaps for a reason that's not commonly discussed.

What the article gets right
The article is quite right to point out that ledger based on blockchain doesn't actually solve things people have problem with. For example, a ledger based on private blockchain is really nothing more than a poor substitute for technologies the banks already have. Public blockchains like Bitcoin is a bit more interesting but I agree that it is unlikely to "change the world", at least alone.

What the article gets wrong
The article doesn't address Smart Contracts, which is much more interesting than boring ledgers. The biggest difference is that we don't have anything like it right now. The ledger is boring because we already have ledgers. Smart Contracts are interesting because we don't have anything quite like it now.

Smart Contracts' little brother did change the world
Before we talk about Smart Contract we should talk about APIs, because in my opinion Smart Contracts are essentially APIs on steroids. So what's API and how did it change the world? API itself is as ancient as programming languages. But we came to understand its true importance only recently: the ability to package services into building blocks and combining them to create new services. Amazon is a famous pioneer in this, who successfully went on to create an empire powered by APIs.

The key concept of API is very simple; all it does is to break up your services and make them separately accessible to others through programming. There is nothing mysterious about it. But its transformation effect was enormous. 15 years ago, if I wanted to create a new service (let's say an e-commerce site), I had to do a lot by myself. Getting servers, configuring network, figuring how to analyse customer behaviour, taking payments... it was a formidable prospect.

Now, I can do most of these things really easily thanks to APIs. I call Amazon's API to get servers and network in an instant. Mixpanel's API to analyse customer behaviour. Stripe's API for payment, and so forth. Creating a new e-commerce site is suddenly pretty trivial. We have to thank API for the recent explosion in app. economy and the startup economy, and bunch of other stuff.

Why is Smart Contract better than API?
While APIs are pretty good at packaging services, it's still rather crude. For a starter, API designers  have to anticipate how their services might get used and define how the API looks like. Good API designs are "generic", meaning that they can support many different, often unanticipated use cases, but a significant restriction is still there.

For example, imagine you want to start a new business called "housing consultant". You give advice to prospective house buyers, and if they end up buying the house, you'll get commission. That's of course certainly possible with just APIs. I track which houses I recommended, periodically check if the clients ended up owning the house through API provided by the government. If they did, I charge them through a payment API. This is pretty easy, but I still need to have servers. I still need to manage the contracts between the client. What if instead, I could just write up a contract that executes itself. I tell my client, "I have a house that's just for you. If you sign this contract, I'll tell you which house. If you end up owning the house within 5 years, I'll charge you $10". Once the contract is signed I can forget about it. If the condition eventually happens, payment is automatically taken.

Maybe one of my client wants to amend the contract; they want to change the condition to "if I end up owning the house for more than 3 years within the next 5 years". No problem, you can just change it a little bit and sign it. No need to worry about making notes, changing software etc. If it says so in the contract, you can be quite sure that it will be executed accordingly.

Let's say you now want to expand your business into advising people on how to save energy; for every dollar saved compared to last year's bill, you get 1%. This is again certainly possible with APIs, but much easier if energy companies offered interaction through Smart Contracts instead.

Let's say you now have bunch of these smart contracts that generates some income from various people, for various reasons, but you find out you are likely to die within a year. No problem, you can just securitise all these contracts and sell them. That itself is nothing new, with enough effort you can do this today. But it's for sure impractical for an individual to do it. With Smart Contract this will become practical, just like API made it possible for an individual to start a e-commerce site overnight.

Where does the trust come from?
"Well how do contracts know when to execute themselves?" you might ask. The easiest is to use  trustable "endpoint"s, like a "land registry smart contract", or a "land registry API" provided by a trust worthy party. For example, we just tell the contract to check an endpoint provided by a government, and if it detects the condition to execute itself. That doesn't sound very magical but we still gained a lot by being able to combine such endpoints in a manner that was not possible before.

There is also some interesting ideas to generate trust in a more unconventional manner. For example, would you trust Wikipedia as a trust worthy endpoint in your smart contract? Maybe not, somebody could inject false information into articles and trigger contracts fraudulently.  

What if I make some adjustments... say I pick a very popular wikipedia article (like POTUS). And we say that the information has to be present in the article 80% of the time for one year before the contract is triggered. We are almost getting there, it'll be pretty hard to pull off this fraud. Extend this concept further and you can crowd source "truth".

Sunday, October 6, 2013

Why I don't like Nutmeg

Nutmeg is a startup that provides investment management for consumers. Basically, when you sign up with them, they'll ask you some questions to figure out your risk tolerance, timeframes for investments, how much you want to pay-in each month etc. They will then go and buy investments (stocks, bonds, REITs etc.) for you and manage your portfolio on behalf of you (i.e. they will buy/sell your assets on their discretion).

They charge up to 1.0% for this asset management piece depending on the amount of asset you have with them etc. This doesn't include the cost that the assets incur themselves which they say is on average about 0.3%.

So why don't I like it?

1.3% is WAY too expensive

Before I can explain why 1.3% is too expensive, I have to quickly touch on the two different investment strategies, Passive and Active.

Active strategy is what most of us usually think of investment strategies. In an active strategy, you'd sit down and think about what to buy. For example, you might go "Now that Ballmer is gone, I bet Microsoft will be way more successful in the future", and thus buy Miscrosoft stocks. Or you might go "I think Chinese stocks are way overvalued these days" and sell some Chinese stocks.

The key here is that you are trying to predict the future and/or outsmart other investors. You predicted that Microsoft will do well and that's why you are buying their stocks. You think other investors are overvaluing Chinese stocks, and that's why you are selling them.

This is obviously hard work and requires ample experience, and this is where companies like Nutmeg come in. They'll do this work for you, and in turn you pay them the 1.3% fee.

In the world of Active investment strategy, 1.3% isn't necessarily expensive. The problem however, is that if you are an average consumer, you won't need Active strategy. There are people who disagree with this (usually fund managers, stock brokers etc.), but the evidences are overwhelming. A recent study by S&P for example, which examined 10,000 actively managed funds, found that 82% of them underperfom passive funds (source).

Warren Buffet, the legendary investor has also stated:

[...] most investors are better off putting their money in low-cost index funds [...] very low-cost index is going to beat a majority of the amateur-managed money or professionally-managed money [...] gross performance [of actively managed funds] may be reasonably decent, but the fees will eat up a significant percentage of the returns 

And put his money where is mouth is. He specified his personal portfolio after his death to be invested in low cost index funds (source).

Really, there is very little reason for an average consumer to go with active investment.

Believe me, it's not hard to do investment yourself

A couple of people have told me "Well, but I don't want to spend any time thinking about my investments. I'd rather just pay 1.3%". They probably don't realize how big the impact of these costs are in the long term.

[...] illustrates how strongly costs can affect long-term portfolio growth. It depicts the impact of expenses over a 30-year horizon in which a hypothetical portfolio with a starting value of $100,000 grows an average of 6% annually. In the low-cost scenario, the investor pays 0.25% of assets every year, whereas in the high-cost scenario, the investor pays 0.90%, or the approximate asset-weighted average expense ratio for U.S. stock funds [...] The potential impact on the portfolio balances over three decades is striking—a difference of almost $100,000 (coincidentally, the portfolio's starting value) between the low-cost and high-cost scenarios

Yep, they call 0.90% "high-cost scenario". 1.3% is really high, and it will cost you a lot.

Asset allocation (i.e. deciding what to buy, what to sell) is very simple in Passive investment. Vanguard, an investment firm that pioneered Passive investment has generally good resources to start, for example this guide. Here is another good starting point from Forbes. These kinds of dead simple portfolio are the ones that "consistently outperforms most actively managed funds" as noted earlier. Once you decided on an asset allocation based on the return you want, the only thing left really is to find the cheapest mutual fund/ETF that tracks that asset class and buying it.

If you are still too lazy to choose your own asset allocation, you can buy "balanced funds", which are mutual funds with pre-made, low-cost index asset allocation. Here is a collection from Vanguard. Note the expense ratio. The lowest one has 0.1% while the highest one has 0.18%. Yeah, compare that with 1.3%.

These "balanced fund"s have in general a very simple asset allocation, and hence it's easy to just copy the allocation. In fact, there is even a tool for that online. This will give you an even lower cost, but the downside is that you'll have to do asset rebalancing yourself whereas "balanced fund"s will do that automatically for you.

So yeah, that's why I don't like Nutmeg. The world needs more passive investment, not active.

Saturday, August 24, 2013

日本人ソフトウェアエンジニアのための、海外スタートアップ企業就職ガイド (職探し編)


  • 米国以外もマークする
    • ソフトウェアエンジニアの場合、シリコンバレーに職が集中しているのでそこばかりマークしてしまいがちですが、他の国にもポジションはあります。数値で比較した訳ではないですが、イギリス、カナダ、オーストラリア、アイルランド、シンガポール、ドイツ、オランダ、スウェーデン、スイス、ルクセンブルグなどに職が多いようでした。
    • 中国、インド、東欧など発展途上国にも職はありますが、給料が低いため貯蓄が十分できなくなってしまう場合が多く、日本でお金が必要になったり、帰国することになった場合に困ってしまうので正直あまりお勧めできません。ただし、中には特別光り輝いているスタートアップもあるので、そういうところに行くのであればアリじゃないかと思います。

  • 日本市場をターゲットにしているスタートアップを探す
    • このパターンが結局一番成功しやすいです。そもそも相手企業もハナから日本から雇うつもりであることが多いので、通常ビザもスポンサーしてもらえますし、会社のお金で面接に呼んでもらえることが多いです(=そのついでに他企業の面接も受けられる)。技術面での要件も低くなります。幹部への道も開けやすいので、キャリアアップの点からも良いです。

  • でも、日本に縁のないポジションも対象に入れる
    • 日本に縁のある求人はそんなに多くないので、それにこだわっているとチャンスが少なくなってしまいます。日本に縁のないポジションでも技術的なウリがあれば十分可能なので、対象に入れるべきです。

  • 日系企業の現地採用は注意する
    • 日系企業への就職は一番ハードルが低いのですが、残念ながらあまり良いことがないのが現実です。僕は日系企業の本社から海外子会社に派遣されて仕事をしていた時期があるのですが、現地採用の日本人人材は基本給料が低く、出世の可能性もほぼない上、雇用の安定性もないことがほとんどです。しかも仕事の仕方などは全く日本企業と変わらない場合が少なくないので、せっかく海外就職したのに日本の企業で派遣社員として働くのと変わらない状況になってしまいます。それなら日系企業の本社に就職して派遣された方が100倍いいに決まっています。

  • 転職エージェントにも当たる
    • 多くの転職エージェントは現地に既にいないと相手にしてくれないのですが、日本に縁のあるポジションや、なかなか人材が見つからないポジションであれば興味を示してくれます。いっぱいあるので、いっぱい当たってみましょう。企業の情報や面接のアドバイスなどがもらえるので、ありがたい存在です。
    • ただし、エージェントはあなたの味方ではなく、お金の味方だということを忘れないでください!(笑)あまりマッチしたポジションでなくても取りあえずねじ込もうとするケースが少なくないですし、企業情報もいいことばかり挙げて悪い情報は出してくれないのが普通です。国の慣習によりますが、日本と違って安く売ってなんぼな場合もあるので、強引に給料を下げる交渉をしてくる場合もあります。どんなにエージェントが親切でも、早めに就職先企業と直接交渉する状況に持っていった方が良いでしょう。

  • 国際的な就職活動で便利なツール
    • Indeed.com
      • 求人のメタ検索サイトです。一言で言えば求人のグーグル、「最強」です。各国版があります。
    • StackOverflow Career
      • 良質なスタートアップ企業の求人が多いです。また、StackOverflowでレピュテーションを貯めているとスカウトされる場合もあります。僕はこのサイトがきっかけで今の企業を見つけました。
    • Monster.com
      • 主にエンジニア向けの求人サイトです。これも各国版があります。個人的にはあまりいけてないポジションが多かったような印象があるのですが、僕の気のせいかもしれません。規模はでかいです。
    • AngelList
      • 主にスタートアップ企業の求人となります。StackOverflow Careerと違って広告を出すのがタダなので、始まったばかりでお金のない小さなスタートアップの求人も出ています
    • Glassdoor
      • 主に給料水準を知るのに役に立ちます。若いスタートアップの場合その企業自体の口コミが投稿されている事は少ないのですが、同じような職種・経験の人の給料水準が見れるので、とても有用です
    • Duedil
      • 会社についての公開情報を集めてきて見せてくれるサイトです。会社として登録してさえあればなんらかの情報は見れるので(創立年、株主構成、資本構成など)、便利です。

Wednesday, August 21, 2013

日本人ソフトウェアエンジニアのための、海外スタートアップ企業就職ガイド (準備編)


  • 関連性の高い学位を取る

    • 関連する学位を持っていると、ビザ取得や就職活動で圧倒的に有利です。情報系の修士以上が理想ですが、学士でも、また数理系(数学、電気工学、物理学など)の学位でも十分役に立ちます。ただ、実務経験を学位に代えられるケースもあるので、関連学位がないと不可能という訳ではありません(何を隠そう僕の学位も生物学でした)

  • 参入障壁の高い技術をウリとして育てる

    • 企業や国の立場に立って考えると、わざわざ外国から人を雇うには相当の理由が必要になります。一言で言うと、これらの「理由」は①給料が安い・仕事がきついため人が集まらない、②必要な技術を持った人が見つからない の2パターンに分けられます。①のコースは色々悲惨なので、絶対に②のコースを目指すべきです。
    • 一番簡単なのは、技術的に難度が高くてニッチな技術を売りにすることです。例えば、生体認証、音声認識、航空管制、暗号技術など非常にニッチな分野は技術者が見つけにくいので、求人さえあれば内定もビザも下りやすいです。ただ、あまりニッチすぎると求人が少なくなるので、分散処理、並行処理、関数型言語、金融フロントなど「適度にニッチ」な分野が最もオッズが良いのではないかと思います。鍵は、難度が高い=参入障壁が高い事です。アジャイル、Webアプリ、クラウドなど敷居が低いウリは国内で十分見つかるので、海外からの就職は難しくなりがちです。
    • 特におすすめなのは機械学習、人工知能、統計などのアナリティクス分野です。この分野は敷居が高い上需要が爆発しているので、スタートアップのみならずあらゆる分野の事業会社、コンサルティング会社などから引く手あまたで、ビザも内定も非常に取りやすいです。かつ給料水準も素晴らしく、幹部へのキャリアパスも開きやすいので、いいこと尽くめです。こっち系にいく選択肢があるのなら迷わず選ぶべき分野です。

  • 公開実績を貯める

    • 海外からの転職活動は会って話ができる時間が短いので、客観的に検証できる材料の比重が上がります。日本で日常生活を送りながら蓄積できるものもたくさんあるので、コツコツ取り組むといつか役に立つ日が来ることでしょう。効果が高い順に並べるとこんな感じです:
      • OSSプロジェクトのコミッターになる
        • 「この有名なプロジェクトのこの部分をやりました、これがそのコードです」と示せればこんなに強力なアピールはありません。例え有名なプロジェクトでなくとも、あるいはコアな部分に関わっていなくとも、何かしら貢献したものがあれば十分アピールになります。
      • OSSを公開する
        • 自分一人ででも、何かしらOSSやサービスを世に出していれば、雇う側としては実力を見極めやすいので強力なアピールになります。もちろんモチベーションが高いことの証明にもなります。メディアはなんでもよいのですが、GitHubにしとけば間違いはありません。
      • 英語ブログを書く
        • 誰も読まないブログでも構いません。日々の発見や思ったことを綴っているだけでプラスになります。雇う側としては英語力のチェックにもなります。後から英語化するのは面倒くさいので、最初から英語のブログプラットフォーム(Bloggerとか)を使うのがおすすめです。
      • StackOverflowでレピュテーションを集める
        • StackOverflowはプログラマ向けのQ&Aサイトです。業界では非常に有名で、ユーザ数を230万数え、有名なプログラマも数多く活動しています。ここで上位の評価を得ていると安心感がありますし、回答の履歴から、その人の興味関心や開発に対する考え方、コミュニケーション能力を推し量ることもできます。目安としては数千くらいレピュテーションを貯めておければベストです。
      • LinkedInでリコメンデーションを集める
        • 非英語圏からエンジニアを雇う際、「英語でうまくコミュニケーションを取っていた」とか「欧米的な風土でうまくやっていた」といったリコメンデーションがついていれば、雇う側に安心感を与える事ができます。必死になって集める必要はないですが、誰か書いてくれそうな心当たりがあれば頼んでおいて損はないでしょう。

  • 英語を磨く

    • いろんなところで「英語ができなくても気持ちさえあれば大丈夫」とか言われていますが、僕は全くそう思いません。確かにエンジニアは営業やマーケティングに比べれば求められる英語要件は低いですが、そもそも面接で実績やスキルをアピールできなくてはいけませんし、「こいつは社内で十分コミュニケーションを取ってやっていけそう」と思ってもらわなくてはいけません。また、先々出世するためには、コミュニケーション力(事実上≒英語力)がより重要になっていきます。焦る必要はないですが、英語力向上に努めるべきなのは明らかです。
    • ちなみに、個人的に英文法の勉強はあまり有用でないと思います。細かい文法が間違っているから言いたいことが伝わらないということはあまりありません。語彙、単語の使い方についての理解、発音の問題で伝わらないことの方が遥かに多いです。ですので、これらの力を強化できるような学習をした方が良いような気がします。

Saturday, August 17, 2013





  1. 貯金力が低くなるため、日本でお金が必要になったり、帰国を強いられた場合に困る
  2. 育児に適した環境を整えにくい












Saturday, July 6, 2013

A simplistic RedShift troubleshooting guide

Are you trying out RedShift, but not quite getting what you want? Confused why there are no INDEX statements? Here is a very quick troubleshooting guide. Disclaimer: RedShift is still new and I haven't used it for that long yet. If you find something inaccurate in this article, please let me know. 

Table of contents:

  • Are you using it for something it wasn't built for?
  • Common design mistakes
  • Common confusions
  • Stuff that you should do
  • Miscellaneous Notes

Are you using it for something it wasn't built for?
This is the most common way to get problems. If you answer YES to the following questions, you might be using RedShift for the WRONG problem.
  1. I'm using it for something non-analytics
    • Yellow flag. RedShift isn't a replacement for "normal" relational DBs (aka "OLTP" DBs) like MySQL, Oracle etc. If you are directly writing to RedShift from your business applications, it's probably wrong.
  2. I don't do aggregation (SUM, COUNT etc.) or TopN (ORDER BY/LIMIT) queries
    • RedShift is primarily good at scanning large number of rows. If you want to quickly fetch small number of records, you might get better result with other DBs that indexes these attributes.
  3. I'm looking for very short response time for reads (e.g. < 1 sec)
    • RedShift is not optimised for "light"queries and tends to have longer response time compared to other DBs in this domain.
  4. I'm looking for very short response time for writes
    • RedShift's latency & throughput for trickle loading was very poor in our experiments. ATM I recommend periodically performing a bulk load. (However, AWS claims it does do trickle loading well; let me know of your experiences!)
  5. I want to run custom functions on my data
    • At the moment, you can't run arbitrary operations on your data in RedShift. If you want to scrub data, parse text, apply models etc. using non-SQL languages like Java, Python you can't use RedShift for that (hint: use EMR for that stage). 

Common design mistakes

  1. Writing directly from applications
    • Directly writing from applications incurs high write latency. The most common way of loading data to RedShift is to periodically export your data to S3 and then use the COPY command.  
  2. Frequently modifying rows
    • Addition of rows can be handled much better in RedShift. You should consider if representing your modifications as new rows (i.e. insert 'deleted' rows instead of actually deleting the original row) will work better.
  3. Choosing a bad DISTKEY (we'll cover this in detail later)
    • The most expensive queries in RedShift are those that do large re-distribution of data. This occurs when you join tables that use a different DISTKEYs.
    • Another common mistake is to choose a DISTKEY that causes a "data skew".
  4. Overly avoiding JOINs
    • Joining large tables isn't something to be scared of if both tables use the join key as DISTKEY. If it makes things easier, don't be afraid of it.

Common confusions

  1. Why do I have inconsistent data in my tables?! I had defined primary key/foreign key/unique constraints!
    • RedShift uses them to optimise queries, but it does not enforce it. You need to enforce it yourself in the ETL process.
  2. OK, how do I create indexes?
    • RedShift doesn't have the usual INDEXes you'll find in other RDBMS.
    • You have knobs to turn, though. DISTKEY and SORTKEY can be thought as indexes  that you fiddle with. 

Stuff that you should do

  1. Consider choosing DISTKEY
    • What is "DISTKEY" anyways?
      • DISTKEY essentially decides which row goes to which node. For example, if you declare "user_id" as DISTKEY, RedShift will do node_id = hash(user_id) % num_nodes to choose the node to store that row. Well, it's not THAT simple, but you get the idea.
    • Why does it matter?
      • DISTKEY primarily matters when you do a join. Let's say a SQL statement SELECT * FROM User INNER JOIN Post ON (User.UserId = Post.UserId) WHERE Post.Type = 1 is issued. If User and Post both used UserId as DISTKEY, a RedShift node can just take the allocated shard, join them, filter them and send the (much smaller) contribution over the wire to be combined. However, if User was distributed by UserId and Post was distributed by ArticleId, Posts that belong to Users on a node will be on other nodes. Therefore the nodes have to ship the entire shard over the network to perform the join, which is expensive.
    • What should I do?
      • If a table is large and you anticipate a join with another large table, then consider choosing the key that will be used for the join to be the DISTKEY. In other words, unless this is the case don't declare a DISTKEY (RedShift will distribute the rows evenly)
    • What is "data skew"?
      • Data skew is when data concentrates on small number of nodes due to a badly chosen DISTKEY. Imagine you have a huge user base which are predominantly located in US. If you use "country_code" as DISTKEY, most of the data will end up on one node because most users will have the same counry_code "US". This means that this one node will do most of the work while other nodes will remain idle, which is inefficient. Therefore, it's important to choose a DISTKEY that will result in an even(-ish) distribution among the nodes.
  2. Consider "series table" to deal with writes to your tables
    • A big part of RedShift's performance comes from the optimised data storage. When you newly load data into a table, its storage is neatly optimised. As you make modifications to the table, you start to disrupt this optimised state, a bit like "fragmenting your hard disk".  That's why you have to perform ANALYZE/VACUUM time to time to correct this (a bit like doing a "defrag"). This can however become expensive at some point. This is where "series tables" helps. For example, you can create a "daily" table for each day and use UNION statement to provide a view that combines these tables. This way, you can perform ANALYZE/VACUUM only on the latest table as you load data & simply get rid of old tables to expire data rather than having to delete rows from a huge table and optimising it afterwards. This is also recommended in the RedShift documentation.
  3. Use SORTKEY
    • SORTKEY essentially defines how the data will be sorted in the storage.
    • This feature is useful to limit the amount of data that has to be scanned. For example, if I have a large table full of news paper articles over a century and want to find article published between 1980 - 1985 that mention "Tiger", it's useful to have articles sorted by published_date on the storage, because that way I can limit the scanning on blocks that contain these dates.
    • They are also useful for joining if the key is also the DISTKEY because the query planner can skip a lot of work.
    • You *can* specify multiple SORTKEYs. When you specify SORTKEY(a, b), the data is effectively sorted as if with "ORDER BY (a, b). If cardinality of a is high enough, filtering by a is very effective, but having a second SORTKEY will make small sense, and vice versa. Therefore the utility of setting multiple SORTKEY is more difficult to judge. Start with a single SORTKEY and see how it goes.
  4. Consider replicating the table as a "JOIN INDEX" ala Teradata if you have more than one column you want to choose as DISTKEY
    • You have more than one column you'd want to elect as DISTKEY but RedShift only lets you choose one. In such cases, you can simply create a replicated table that only differs in which key is declared as DISTKEY. This might seem like a poor idea, but it's essentially what Teradata (a similar technology to RedShift) does for its join index feature. You might be worried about maintaining the consistency between these tables, but because you are usually doing analytics & load data in bulk, it's usually not a problem.

Miscellaneous Notes

  1. Don't worry if your CPU utilisation is high
    • Part of what makes these technologies powerful is the ability to exploit HW through efficient parallell processing, which means high CPU utilisation (spikes). Don't think you need to add nodes just because CPU utilisation sometimes hits 100%.
    • Don't focus on CPU and overlook other signs, like high network usage (which may indicate data re-distribution). 
  2. Use WLM to counter resource hogging
    • When queries are issued concurrently, resource hogging can become a problem. For example, if somebody issues 10 queries that take 1 hour each, another guy with a 5 min query can wait for a long time before he can get his query done. To prevent this kind of problem, consider using WLM.

Happy RedShift development! :)

Thursday, February 2, 2012





CA型とCP型のシステムは、事実上区別できない。つまり、CAP定理では、あたかもCA, CP, APの計3種のシステムが存在するような印象を受けるが、実際にはCA/CP型とAP型の2種類しかない。


  1. CA型のシステムとは:
    • Partitioningが起きない限り(=普段は)、ConsistentかつAvailableである
    • Partitioningが起きると、システムは機能を失う

  2. CP型のシステムとは:
    • 普段はConsistentである
    • Availableとは、「ノードがfailureしていないかぎり応答を返せること」なので、実は、普段(=Partitioningが起きていない時)はCP型のシステムもAvailableである
    • Partitioningが起きると、Availableではなくなるが、Consistentであり続ける


  2つのノードで同期的にレプリケーションを行っているリレーショナルDBを考えましょう。どのWriteも、両方のノードのディスクに書き終わらない限り完了しないという設定で組んでいるとします。もし2つのノード間で通信が途絶えてしまったら、両方のディスクに書けないのでシステムは機能を失ってしまいます。これが、CA型のシステムですね。  これではシステム全体の可用性が少なすぎるとしましょう(どちらかのノードが壊れるか、通信が途絶えるかするだけでシステムが停止するので)。そこで、「通信が途絶えたら、IPの若い(順番が)方が自分のディスクだけをつかってオペレーションを継続する」という約束をすることにしましょう。こうすれば、通信が途絶えてもIPの若いほうが生きてさえいればオペレーションを続けることができます。
  もし、どうしてもAvailabilityを保持したかったとしたらどうでしょうか?その時は、レプリケーションを非同期にする必要があります。お互いのDBが、発生した変更をキューに貯めておいて、通信が復活したら相手に伝達します。  その間は、お互いDBの中身が違うので、Consistencyは守られません。
  ここまでが、「CA型とCP型のシステムは、事実上区別できない」とするAbadi氏の主張の説明です。Abadi氏はこの論理から、PACELCの"PAC"を提唱しています。「もしPが起きたら、AとCどちらを選びますか?」という意味です。  次に、残りの"ELC"について見てみましょう。先に説明しますと、ELCとは else Latency xor Consistencyです。PACから書くとこうなります。
if P then Availability xor Consistency else Latency xor Consistency
もしネットワーク分断が起きたら、Availabilityを選びますか?それともConsistencyを選びますか? あと、ネットワーク分断が起きていない時は、Latencyを選びますか?それともConsistencyを選びますか?
  では、いくつか実例を見てまとめとしましょう。PCECシステム(if Partition then Consistency, else Consistency)はどんなシステムでしょうか?何回か例に出てきた、2つのノードが同期的にレプリケーションを行っているリレーショナルDBなどが考えられます。このシステムは、ネットワーク分断が発生すると、生きていてもリクエストを処理できないノードができてしまいます(if P then Availability喪失)。普段も、2ノードでWriteが完了しないとリクエストが完了しないので、Latencyも悪いです(elseの時、 Latency無し)。その代わり、普段はConsistencyが守られていますし( else Consistency)、Pが起きてもConsistencyを守り続けます。つまり、if P then Consistency (and not Availability), else Consistency (and not Latency)、よってPCECです。
  PAELシステムはどうでしょうか?これも前出てきたDNSが当てはまります。普段はそのDNSサーバのテーブルがアップデートされてさえいれば、他のDNSサーバがアップデートされているかなど考えずに応答を返すので、Latencyはいいです(else Latency)。DNSサーバ間で通信が途絶えて、同期が行えなくなっても各自応答は返し続けるので、if P then Availabilityとなっています。その代わり、要求を処理するDNSサーバによって結果が変わることがありえるので、Consistencyは失われています。