microsoftのdeberta-v3-baseを使い方メモ

kaggleのコンペに参加した時に自然言語処理の扱い方について全く分からなかったため、最初に確認したことについてまとめました。

1. ライブラリのインストール
2. トークナイザーとモデルの読み込み
３．テキストのトークン化
4. モデルでエンコーディングを取得
５. トークン情報の取得
【余談】model(**inputs)の**inputsに先頭の**が付く理由

1. ライブラリのインストール

pip install transformers torch sentencepiece

transformers: Hugging FaceのTransformersライブラリです。多くの事前学習済みモデルとその関連ツールを提供します。
torch: PyTorchです。深層学習のフレームワークで、モデルのトレーニングや推論に使用します。
sentencepiece: トークナイザーの一種で、特にサブワード単位でのトークン化を行います。DeBERTaV2Tokenizerが依存しているライブラリです。

2. トークナイザーとモデルの読み込み

from transformers import DebertaV2Tokenizer, DebertaV2Model
import torch

# DeBERTaモデルのパス（例として "microsoft/deberta-v3-base"）
deberta_path = "microsoft/deberta-v3-base"

# トークナイザーとモデルの読み込み
tokenizer = DebertaV2Tokenizer.from_pretrained(deberta_path)
model = DebertaV2Model.from_pretrained(deberta_path).cuda()  # GPUを使用する場合

モデルのパスはこちらから取得可能です。

DebertaV2Tokenizer.from_pretrained(deberta_path): deberta_pathで指定されたモデルのトークナイザーを読み込みます。
deberta_path: 事前学習済みモデルの名前またはパスです。ここではHugging Faceのモデルハブにあるmicrosoft/deberta-v3-baseを指定しています。
DebertaV2Model.from_pretrained(deberta_path): deberta_pathで指定されたモデルの重みを読み込みます。
.cuda(): モデルをGPUに移動させます。GPUを使用しない場合は省略可能です。

３．テキストのトークン化

# サンプルテキスト
text = "Hello, this is a test sentence for DeBERTa."

# テキストをトークン化
inputs = tokenizer(text, return_tensors="pt").to('cuda')  # GPUを使用する場合

tokenizer(text, return_tensors=”pt”): テキストをトークン化し、指定された形式のテンソルを返します。
text: トークン化する入力テキストです。
return_tensors=”pt”: 出力形式を指定します。ここではPyTorchのテンソル形式（”pt”）を指定しています。返り値は辞書型です。返り値は以下のように確認することができます。

# キーと値のペアを取得する
items_list = list(inputs.items())
print(items_list)  # [('key1', 'value1'), ('key2', 'value2'), ('key3', 'value3')]

出力結果を確認すると、input_ids, token_type_ids, attention_maskのキーが格納されているようです。

[('input_ids', tensor([[    1,  5365,   261,   291,   269,   266,  1010,  4378,   270,  2060,   67943,   452,   260,     2]], device='cuda:0')), ('token_type_ids', tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0')), ('attention_mask', tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0'))]

さらに、キー「input_ids」についてどのような構造になっているか確認しました。

input_ids = inputs['input_ids']
print(input_ids)
print(type(input_ids))

# インデックス番号, 要素の順に取得
for j, input_id in enumerate(input_ids):
    print(str(j) + ":" + str(input_id))
    print(type(j))
    print(type(input_id))

結果

tensor([[    1,  5365,   261,   291,   269,   266,  1010,  4378,   270,  2060,
         67943,   452,   260,     2]], device='cuda:0')
<class 'torch.Tensor'>
0:tensor([    1,  5365,   261,   291,   269,   266,  1010,  4378,   270,  2060,
        67943,   452,   260,     2], device='cuda:0')
<class 'int'>
<class 'torch.Tensor'>

.to(‘cuda’): 入力テンソルをGPUに移動させます。GPUを使用しない場合は省略可能です。

※複数のテキストを一括トークン化する方法は以下のようになります。テキストがリスト型になり、tokenizerの引数にpaddingとtruncationが増えます。

# サンプルテキストのリスト
texts = [
    "Hello, this is a test sentence for DeBERTa.",
    "This is another sentence."
]

# 複数テキストを一括トークン化
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True).to('cuda')  # GPUを使用する場合

# モデルを使ってエンコーディングを取得
with torch.no_grad():
    outputs = model(**inputs)

# 出力結果
last_hidden_states = outputs.last_hidden_state

print(last_hidden_states)

# キーと値のペアを取得する
items_list = list(inputs.items())
print(items_list)  # [('key1', 'value1'), ('key2', 'value2'), ('key3', 'value3')]

input_ids = inputs['input_ids']
print(input_ids)
print(type(input_ids))

# インデックス番号, 要素の順に取得
for j, input_id in enumerate(input_ids):
    print(str(j) + ":" + str(input_id))
    print(type(j))
    print(type(input_id))

結果

リスト形式の要素が１つ増えています。

tensor([[[ 0.1410,  0.2415,  0.0471,  ..., -0.0403,  0.2009,  0.0085],
         [ 0.5594,  0.5406, -0.0293,  ..., -0.8153, -0.2713,  0.3581],
         [ 0.3453, -0.1580, -0.6314,  ...,  0.5872, -0.2516, -0.8309],
         ...,
         [-0.1474,  0.7867, -0.6533,  ...,  0.8116,  0.7660,  0.1426],
         [ 0.3337,  0.4113,  0.0041,  ...,  0.6645,  0.0348, -0.2286],
         [ 0.2098,  0.2286,  0.0217,  ..., -0.0229,  0.2234,  0.0148]],

        [[ 0.1414,  0.2662,  0.0501,  ..., -0.0337,  0.1750, -0.0078],
         [ 0.7238,  0.5721,  0.1514,  ..., -0.7817, -0.6813,  0.0939],
         [ 0.2918,  0.4927, -0.0177,  ..., -0.1356, -0.3676, -0.0925],
         ...,
         [-0.1517,  0.1928, -1.7749,  ...,  0.1492, -0.2373,  1.1913],
         [-0.1517,  0.1928, -1.7749,  ...,  0.1492, -0.2373,  1.1913],
         [-0.1517,  0.1928, -1.7749,  ...,  0.1492, -0.2373,  1.1913]]],
       device='cuda:0')
[('input_ids', tensor([[    1,  5365,   261,   291,   269,   266,  1010,  4378,   270,  2060,
         67943,   452,   260,     2],
        [    1,   329,   269,   501,  4378,   260,     2,     0,     0,     0,
             0,     0,     0,     0]], device='cuda:0')), ('token_type_ids', tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0')), ('attention_mask', tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0'))]
tensor([[    1,  5365,   261,   291,   269,   266,  1010,  4378,   270,  2060,
         67943,   452,   260,     2],
        [    1,   329,   269,   501,  4378,   260,     2,     0,     0,     0,
...
1:tensor([   1,  329,  269,  501, 4378,  260,    2,    0,    0,    0,    0,    0,
           0,    0], device='cuda:0')
<class 'int'>
<class 'torch.Tensor'>
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings.

texts: トークン化する複数のテキストのリストです。
padding=True: シーケンスの長さを揃えるためにパディングを適用します。
truncation=True: シーケンスが長すぎる場合にトランケート（切り詰め）します。

4. モデルでエンコーディングを取得

# モデルを使ってエンコーディングを取得
with torch.no_grad():
    outputs = model(**inputs)

# 出力結果
last_hidden_states = outputs.last_hidden_state
print(last_hidden_states)
print(type(last_hidden_states))

torch.no_grad(): 勾配計算を無効にしてメモリ使用量を節約します。推論時に使用します。
model(**inputs): モデルにトークン化された入力を渡してエンコーディングを取得します。
**inputs: トークン化された入力を展開して渡します。通常は入力テンソルの辞書が渡されます。
outputs.last_hidden_state: モデルの出力の最後の隠れ状態を取得します。

出力結果

tensor([[[ 0.1410,  0.2415,  0.0471,  ..., -0.0403,  0.2009,  0.0085],
         [ 0.5594,  0.5406, -0.0293,  ..., -0.8153, -0.2713,  0.3581],
         [ 0.3453, -0.1580, -0.6314,  ...,  0.5872, -0.2516, -0.8310],
         ...,
         [-0.1474,  0.7867, -0.6533,  ...,  0.8116,  0.7660,  0.1426],
         [ 0.3337,  0.4113,  0.0041,  ...,  0.6645,  0.0348, -0.2286],
         [ 0.2098,  0.2286,  0.0217,  ..., -0.0229,  0.2234,  0.0148]]],
       device='cuda:0')
<class 'torch.Tensor'>

５. トークン情報の取得

# テキストをトークン化してトークン情報を取得
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)

# トークンIDを取得
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Token IDs:", token_ids)

tokenizer.tokenize(text): テキストをトークン化してトークンのリストを返します。
text: トークン化する入力テキスト。
tokenizer.convert_tokens_to_ids(tokens): トークンのリストをトークンIDのリストに変換します。このIDは、モデルの語彙（ボキャブラリ）によって定義されているため、同じトークンでも値が違います。
tokens: トークンのリスト。

出力結果

Tokens: ['▁Hello', ',', '▁this', '▁is', '▁a', '▁test', '▁sentence', '▁for', '▁De', 'BERT', 'a', '.']
Token IDs: [5365, 261, 291, 269, 266, 1010, 4378, 270, 2060, 67943, 452, 260]

【余談】model(inputs)のinputsに先頭の**が付く理由

model(**inputs)の**inputsに先頭の**が付く理由は、Pythonの引数展開機能を利用しているからです。具体的には、辞書（inputs）をキーワード引数として関数（ここではモデル）に渡すために使います。この操作は、辞書のキーを引数名、辞書の値を引数の値として展開します。このように記述することで、コードの保守性と柔軟性が向上します。
下記のコードは、example_function(1, 2, 3)と同じように引数が渡されます。

def example_function(arg1, arg2, arg3):
    print(arg1, arg2, arg3)

# 辞書を用意
args = {
    'arg1': 1,
    'arg2': 2,
    'arg3': 3
}

# 辞書を展開して関数に渡す
example_function(**args)