還債系列

DynamoDB

DynamoDB 是 AWS 提供的 Managed NoSQL DBMS，取代之前的 SimpleDB；其設計概念為 Scalable (能應付極大量資料)、Available (HA Ready，跨 Region 當然還是要自己負責) 與 Fast （回應延遲穩定，數毫秒）。

為了達成這些目標，它拋棄了 SimpleDB Index Everything 的概念，而往 Key-Value Store 靠攏；並透過 LSI 與晚些公開的 GSI，給予開發者更多彈性。又因為它是 NoSQL，因此 Transaction 退化成了 Atomic Document 操作，且不支援 Join 操作；RDBMS 的用戶可能要重新適應學習，因為表格設計、查詢效能、以及性能調校都截然不同。

資料儲存與 Primary Key

雖說許多 NoSQL DBMS 以類似 JSON 的方式儲存資料，但 DynamoDB 特別聲明它使用自訂儲存格式，僅將 JSON 作為輸入輸出格式。即便如此，屬性名會保存在文件中，並佔用儲存空間，因此應使用短且有意義的名稱，對操作性能（與價格）有幫助。

DynamoDB 的屬性可包含六種型態，分別為 String、Number、Binary (以 Base64 表示)、String Set、Number Set、與 Binary Set。不支援 Sub-document 的限制可以靠自行 JSON encode / decode 後以 String 儲存來避過。要特別注意 DynamoDB 接受 set 而非 array，因此值不能重複。

在建立 DynamoDB 表格時，必須給定所需要的所有索引（包含 Primary Key, LSI 與 GSI），並且之後無法修改。除了被索引的欄位以外，屬性可以是任何型態，也可以被更新為任意型態；而被索引的欄位只能為 Scalar 型態，包含 String, Number 與 Binary。

DynamoDB 的 Primary Key 包含 Hash Key 與選擇性的的 Range Key；前者被用於 shard 資料，後者則用於取得並排序資料。正如其他 DBMS，DynamoDB 的 Primary Key 保證唯一性 (Uniqueness)；且 Hash Key 的 Cardinality 決定資料分配，選擇不當可能造成性能低落。

Index

Local Secondary Index (LSI)

正如 PK，LSI 是在 Hash Key 之外，允許指定另一組 Range Key；並且我們可以指定要 project 到哪些屬性，將它寫入 Index entry 中。為了能由 Index entry 取回原 Document，則 Hash Key （也作為 LSI 的 Hash Key，因此必然存在）與 PK:Range Key 必須被 Project。

在 DynamoDB 裡，要使用 Query API 對 Index 進行查詢；而對 LSI 查找時，因為 LSI 與 Document 在同一台機器上，因此若要求讀取未被 Project 的屬性，則 DynamoDB 會自動多執行一次讀取，以取回 Document。這會造成額外的 Read throughput 消耗與 latency，實務上要嚴格避免。

修改資料時，若該欄位屬於 PK 或被 Project，則該 Index Entry 也需要更新，因此要消耗額外的 Write throughput。因此要計算使用模式，以便對 Index 與 Projection 進行取捨。

LSI 不強迫一致性，因此可以有很多筆 Hash Key 與 Range Key 相同的資料。

Global Secondary Index (GSI)

這是比較晚近實做的功能，用戶可以指定想要的 Hash 與 Range Key，其原則與 PK 接近，可以想像為 DynamoDB 產生對應的 GSI table 及 item，並在背景維護。與 LSI 相同，GSI 不強迫 key uniqueness；而與 LSI 不同，GSI 不支援自動讀取 Document 以取得未 Project 之屬性，並且 GSI 有自己的 throughputs。

CRUD

DynamoDB 提供幾隻資料操作 API，分別為：

寫入：PutItem, UpdateItem
刪除：DeleteItem
讀取：Query, Scan

Create / Update

PutItem 與 UpdateItem 預設行為接近 Upsert，但其細節略有不同：

PutItem 行為類似 RESTful POST；相同 PK 資料存在時，會覆蓋整個文件；
UpdateItem 行為類似 RESTful PATCH；相同 PK 資料存在時，只覆蓋帶入的欄位；支援 Atomic Document Operation

使用 UpdateItem 時，可以指定欄位更新的 Action 屬性：

PUT: 預設值， Upsert 該屬性的值；若 Document 不存在，則行為與 PutItem 相同。
DELETE: 若不帶該屬性的新值，則移除該屬性；否則是集合的差集 (傳入的 Value 要是同型態的 set)
ADD: 若欄位 (傳入值與舊有欄位) 為同型態的 set，則為聯集。若傳入值為數字，則是 atomic counter；若原屬性不存在，則預設為 0。
若 Document 不存在，Manual 寫說只有 Number 與 Number Set 能寫入文件；我未測試其正確性，但 String Set / Binary Set 應該要能比照辦理。

在調用 PutItem 與 UpdateItem 時，可以帶入 expected 參數，進行 Conditional Update，以實現 MVCC 與 atomic counter 等模式。舊 API 以 Value / Exists 兩組屬性進行稽核，但目前建議改以比較運算子搭配 AttributeValueList 判斷，添增處理彈性。

Conditional Updated Metrics

使用 Conditional Update 時，不消耗 Read Capacity；但依照 Manual 說明，即使 expected 條件不滿足（寫入失敗），很可能仍會消耗 1 write capacity。

Delete

與 Create / Update 類似，可以引入 expected 欄位進行 conditional delete；並至少消耗 1 write throughput。

Read

DynamoDB 提供兩隻 API (Query, Scan) 以讀取資料，又分為 Eventual Consistency 與 Strong Consistency 兩種形式；只有對 LSI 或 PK 的讀取能選擇 Strong Consistency，而 Eventual Consistency 對 Read Throughput 的消耗是 Strong Consistency 的一半。

每次查詢回傳資料上限為 1mb，超過則 DynamoDB 會回傳 LastEvaluatedKey，便於查詢接續；但因為不支援 Transaction，列舉可能會遺漏新增資料，在規劃流程時要特別注意。

Scan

Table Scan；若表格很大，會因為讀表過程對同一 Hash Key 連續存取，造成類似 poor cardinality 的結果。透過 parallel scan 可以改善這個問題，優化資源運用。

Query

透過 PK, LSI 或 GSI 讀取資料。Query PK 能直接取得 Document，以及其中任何欄位的值；若使用 GSI 或 LSI 查詢，並且只取回 projected attributes ，DynamoDB 只須讀取 Index，因此 Read Capacity 也能合併計算。

對於屬性與 Projection 的問題已在 Index 小節提過，不贅述。

Select / AttributesToGet

前者指示 DynamoDB 如何回傳，而後者則列舉要取回的屬性。Select 的候選值有：

ALL_ATTRIBUTES：傳回所有屬性；若 LSI 未 Project 至所有屬性，則會自動讀取 Document
ALL_PROJECTED_ATTRIBUTES：傳回所有 Projected 的屬性
COUNT：傳回符合查詢的列數，關於 Throughput 用量我會再做測試並說明
SPECIFIC_ATTRIBUTES：傳回 AttributesToGet 中列舉的屬性

又 Disk I/O 只與讀取的資料數量有關，因此大多情況下，調整 AttributesToGet 對 Read Throughput 幾無助益（例外：LSI & projected attributes）

Filter

對於未能以 Index 篩選的條件，可以寫成 Filter 由 DynamoDB 排除；有點像是 MySQL 中未能利用 Index 的 Where clause，終由 MySQL Engine 進行運算的場景。

雖說對 Read Throughput 沒有幫助，但善用 Filter 能壓低資料傳輸量，並減低 Scan / Query 因為單筆 1mb 限制，而要送出請求的次數（單位時間內耗用的 read throughput 可能反而增加）

Throughput

這應該是 DynamoDB 的一大賣點：架構得當的前提下，從 1 長到 10k qps，只需要改個數字；更高也只消寫封信，然後繼續改數字就好 🙂 但若設計不當，帳單多個幾倍也很正常。

Write Throughput

相對單純，每寫入 1 筆資料，1kb 就消耗 1 write unit。數值無條件進入，index 獨立計算，並且無法合併。因此寫入兩筆 100 bytes 的資料，沒有 LSI / GSI，將消耗 2 write units。

Read Throughput

Consistent Read 4kb 資料消耗 1 read unit，循序讀取（Scan, GSI, LSI with projected attributes）可合併計算，若使用 eventually consistent read 則消耗減半。因此若 scan 存有 6k 筆資料，每筆平均 40 bytes 的話，將消耗 30 read units (6k * 40 / 8k = 30, scan 只支援 eventually consistent read)。

Access Control with IAM

DynamoDB 對 IAM Policy 的支援較完備，能透過 Resource 指定用戶能夠讀取的表格、Index、甚至是 Document 與欄位內容。其中前兩者屬於基本範疇，透過 arn 完成；後兩者則屬於 DynamoDB 進階權限控制，需使用 IAM Policy Variable 搭配 Condition。

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "dynamodb:GetItem",
                "dynamodb:BatchGetItem",
                "dynamodb:Query",
                "dynamodb:PutItem",
                "dynamodb:UpdateItem",
                "dynamodb:DeleteItem",
                "dynamodb:BatchWriteItem"
            ],
            "Resource": [
                "arn:aws:dynamodb:*:*:table/GameScores"
            ],
            "Condition": {
                "ForAllValues:StringEquals": {
                   "dynamodb:LeadingKeys":  ["${www.amazon.com:user_id}"],
                   "dynamodb:Attributes": [
                       "UserId","GameTitle","Wins","Losses",
                       "TopScore","TopScoreDateTime"
                    ]
                },
                "StringEqualsIfExists": {"dynamodb:Select": "SPECIFIC_ATTRIBUTES"}
            }
        }
    ]
}

這是來自官方文件的範例，其中

#6-14: 限制用戶能呼叫的 API
#15-17: API 對應的是 GameScores 表
#20: 表格的 hash key 需等於 ${www.amazon.com:user_id} 這項 Policy Variable；當用戶以 Login with Amazon 登入時，這項變數的值會被替換為他的 Amazon ID (數字)
#21-24, 26: 強迫 Select 為 SPECIFIC_ATTRIBUTES, 且只能存取特定欄位

看完這篇以後，相信對於 DynamoDB 會有基本的認識。透過之前簡單的測試，我懷疑 Select : Count 無須遍歷 Index Entries 便能取得；若是如此，它所消耗的 read capacity 近似常數，將非常適合用於統計人數票數。

沒意外的話，這週會把 Headless Poller 設計概要完成；至於有沒有時間對 DynamoDB 進行更多測試，就隨緣吧 😉

1 Comment

SDBExplorer

2014-06-11 - 13:24:46

Amazon web service provides non-relational database services called Aamazon SimpleDB. Amazon SimpleDB can be useful for those who need a non-relational database for storage of smaller, non-structural data. Amazon SimpleDB has restricted storage size to 10GB per domain and it can achieve up to 25 writes/second. Amazon SimpleDB offers simplicity and flexibility. SimpleDB automatically indexes all data. Amazon SimpleDB pricing is based on your actual box usage. You can store any UTF-8 string data in Amazon SimpleDB.

On the different note – SDB Explorer provides an industry-leading and intuitive Graphical User Interface (GUI) to explore Amazon SimpleDB service in a thorough manner, and in a very efficient and user friendly way.

http://www.sdbexplorer.com/

「為立葉」於〈又好久不見了 T_T〉發佈留言
「clifflu」於〈AWS VPC 心得〉發佈留言
「shazi7804」於〈AWS VPC 心得〉發佈留言
「1229387123」於〈自用進口 Echo Dot 筆記〉發佈留言
「clifflu」於〈Lambda Container Reuse〉發佈留言

clifflu 又架 blog 了 O.o/

DynamoDB 概述

還債系列

DynamoDB

資料儲存與 Primary Key

Index

Local Secondary Index (LSI)

Global Secondary Index (GSI)

CRUD

Create / Update

Delete

Read

Scan

Query

Select / AttributesToGet

Filter

Throughput

Write Throughput

Read Throughput

Access Control with IAM

請按讚：

相關

1 Comment

發表迴響Cancel reply

DynamoDB 概述

還債系列

DynamoDB

資料儲存與 Primary Key

Index

Local Secondary Index (LSI)

Global Secondary Index (GSI)

CRUD

Create / Update

Delete

Read

Scan

Query

Select / AttributesToGet

Filter

Throughput

Write Throughput

Read Throughput

Access Control with IAM

分享此文：

請按讚：

相關

1 Comment

發表迴響 Cancel reply

發表迴響Cancel reply