当前位置：首页 > news >正文

营销型网站建设好不好海淀网站开发

news 2025/12/23 18:36:54

营销型网站建设好不好,海淀网站开发,网站样式侵权,少女大人免费观看高清电视剧韩剧一、介绍介绍大模型#xff0c;首先考虑一下使用 ChatGPT、Bing Chat 或 Bard 。您是否想过拥有自己的 ChatGPT 会是什么样子#xff1f;想象一下创建自己的 GPT 模型的兴奋程度。这确实是一种难以置信的感觉#xff01; 为了开始构建自定义 GPT 的旅程#xff0c;让我们仔… 一、介绍介绍大模型首先考虑一下使用 ChatGPT、Bing Chat 或 Bard 。您是否想过拥有自己的 ChatGPT 会是什么样子想象一下创建自己的 GPT 模型的兴奋程度。这确实是一种难以置信的感觉为了开始构建自定义 GPT 的旅程让我们仔细看看 GPT 的运作方式。二、了解 GPT GPT 是Generative Pre-trained Transformer的缩写是一种充当解码器的 Transformer 模型。它已在 BookCorpus 和 WebText 等广泛数据集上进行了预训练。为了更深入地了解 GPT有必要探索各种类型的变压器并深入研究纯解码器变压器的运行机制。三、变形金刚解码 Transformer 是一种深度神经网络架构专门用于生成类似人类的文本。LLM 模型基于三种类型的变压器架构进行开发有些模型正在其基础上进行开发。仅编码器变压器仅编码器变压器是变压器架构的一种特殊变体仅专注于理解和编码输入序列的任务。它们是仅使用 Transformer 模型的编码器部分的神经网络。仅编码器转换器用于各种自然语言处理任务例如文本分类、命名实体识别和情感分析。仅编码器 Transformer 最突出的例子是BERT来自 Transformers 的双向编码器表示。仅解码器变压器仅解码器变压器是一种仅使用变压器模型的解码器组件的变压器架构。这些模型用于需要解码器组件的任务例如文本生成、机器翻译和摘要。解码器组件负责从输入序列生成输出序列。仅解码器变压器最突出的例子是GPT生成预训练变压器。Encoder-Decoder TransformersCross-Attention基于 Transformer 的 Encoder-Decoder 模型是一种神经网络架构广泛应用于语言翻译、文本摘要等 NLP 任务中。这就是原来的变压器。该架构由两个主要组件组成编码器和解码器。编码器处理输入序列并生成输入的连续表示而解码器则根据编码器的表示生成输出序列。基于Transformer的编码器-解码器模型已在T5、Bart、Pegasus、ProphetNet、Marge等许多预训练模型中使用。四、了解 GPT 架构变压器块 Transformer 块具有三个目标准备、丰富和预测。 4.1 准备阶段变压器模型的初始输入由一系列单词标记组成然后将其转换为单词嵌入。这些嵌入通过位置编码来丰富以传达其位置信息。位置编码在 Transformer 模型中位置编码用于定义序列中实体的位置确保每个位置都有不同的表示。变压器本身缺乏对输入序列顺序的固有理解。序列内特定位置 (p)和嵌入空间内维度 (i)的位置编码是通过正弦和余弦函数的组合来实现的。这里“d”表示词嵌入的维度。这些函数为各个单词位置生成不同的代码。这些代码可以扩展到训练阶段未遇到的序列长度。下一步是将这个新序列输入到变压器块中其中每个元素都被视为密集向量。 4.2 浓缩阶段丰富包括多头注意力、位置前馈神经网络、残差连接和层归一化。多头注意力注意力用于评估单词之间的重要性和联系。它通过合并更多上下文和逻辑来增强矢量表示。注意力依赖于三个向量查询向量、键向量和值向量它们源自初始层即词嵌入向量。查询向量对应于当前标记键向量包含序列中的所有其他标记并且值向量包含序列中的所有标记。在自注意力过程中我们首先计算查询向量和关键向量的缩放积以获得注意力分数。接下来通过 softmax 函数处理该注意力分数产生一组范围从0 到 1的注意力权重。然后每个值向量按其各自的注意力权重进行缩放最后它们的总和产生自注意力层的输出。术语“多头注意力”源自其多三个注意力层的组成。早期注意在这种情况下短程依赖涉及相邻标记之间的关系例如词序、词性和基本句子结构而不需要逐步方法。中间注意力它包含输入序列的更广泛的上下文其中可能包括语义信息、含义、短语之间的关系以及句子中不同单词的角色。后期注意力它结合较低层以产生有凝聚力且上下文相关的结果包括高级抽象、话语结构、情感分析和复杂的长期联系 Position-wise Feed-Forward Neural Network该组件的作用是获取序列中每个元素的注意力阶段收集的信息并将其转换为更丰富的状态。它促进序列中每个元素的非线性变换并在后续层中继续构建自身。残余连接残余连接促进信息从较早的层直接流向较晚的层。它们在解决深度神经网络中经常遇到的梯度消失问题方面发挥着至关重要的作用。层标准化它不是根据批次标准化输入而是跨特征执行标准化。这有助于通过确保一致的输入分布来稳定网络的训练这对于具有不同序列长度的任务至关重要。 4.3 预测阶段在这个过程中线性函数和softmax函数发挥了至关重要的作用。首先我们有一系列来自最后一个转换器块的上下文感知向量。该序列中的每个向量代表一个输入标记并受到其与所有其他标记的交互的影响。为了将向量的输出序列投影到维度为N_w的空间其中N_w是词汇表大小我们使用线性函数。然后我们将 softmax 函数应用于这些投影向量以创建词汇表上的概率分布。这种分布有助于我们预测序列中的下一个标记。五、变压器中的重要变量 5.1 输入变量词汇量这是指模型可以识别的唯一标记的数量。嵌入/模型大小表示词嵌入的维度也称为隐藏大小。序列/上下文长度这表示模型一次性可以处理的最大标记数。 5.2 内部变量注意力头计数在多头注意力中输入被分为特定数量的注意力头。中间层大小前馈网络中中间层的大小通常大于嵌入大小。层数这对应于变压器块的数量。 5.3 训练变量批量大小它是指在训练期间一次前向传递中一起处理多少个示例。训练的令牌它表示模型在训练期间遇到的令牌总数通常比纪元数更频繁地报告。六、自定义类似 GPT 的模型使用 Pytorch我们将构建自定义的类似 GPT 的模型。首先我们将导入所有必需的库。 # Import the necessary libraries import torch import torch.nn as nn import torch.nn.functional as F import math import time import numpy as np import matplotlib.pyplot as plt import seaborn as sns 我们将定义 DecoderBlock它是变压器块的单层。我们将定义解码器块所需的超参数 d_model输入向量的维度。num_heads多头注意力机制中头的数量。ff_hidden_layer前馈隐藏层的维度。辍学率辍学率。前向方法需要两个输入 x输入张量。target_mask防止对某些位置的关注的掩码。 # Decoder Block class DecoderBlock(nn.Module):def __init__(self, d_model, num_heads, ff_hidden_layer, dropout):super(DecoderBlock, self).__init__()self.self_attention nn.MultiheadAttention(d_model, num_heads, dropoutdropout)self.norm1 nn.LayerNorm(d_model)self.dropout1 nn.Dropout(dropout)self.linear1 nn.Linear(d_model, ff_hidden_layer)self.linear2 nn.Linear(ff_hidden_layer, d_model)self.norm2 nn.LayerNorm(d_model)self.dropout2 nn.Dropout(dropout)def forward(self, x,target_mask):attn_output, _ self.self_attention(x, x, x, attn_masktarget_mask)x x self.dropout1(attn_output)x self.norm1(x)ff_output self.linear2(F.relu(self.linear1(x)))x x self.dropout2(ff_output)x self.norm2(x)return x 现在让我们创建 PositionalEncoding 类它应用唯一的位置编码来为模型提供有关序列中标记的相对或绝对位置的信息。 # Positional Encodingclass PositionalEncoding(nn.Module):def __init__(self, d_model, dropout0.1, max_len5000):super(PositionalEncoding, self).__init__()self.dropout nn.Dropout(pdropout)pe torch.zeros(max_len, d_model)position torch.arange(0, max_len, dtypetorch.float).unsqueeze(1)div_term torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))pe[:, 0::2] torch.sin(position * div_term)pe[:, 1::2] torch.cos(position * div_term)pe pe.unsqueeze(0).transpose(0, 1)self.register_buffer(pe, pe)def forward(self, x):x x self.pe[:x.size(0), :]return self.dropout(x) 我们需要屏蔽解码器的输入以防止关注未来的位置。 def generate_square_subsequent_mask(sz):Generate a mask to prevent attention to future positions.mask (torch.triu(torch.ones(sz, sz)) 1).transpose(0, 1)mask mask.float().masked_fill(mask 0, float(-inf)).masked_fill(mask 1, float(0.0))return maskmask generate_square_subsequent_mask(sz5)plt.figure(figsize(5,5)) sns.heatmap(mask, cmapcrest, cbarFalse, squareTrue) plt.title(Mask for Transformer Decoder) plt.show() 现在我们来描述完整的 Transformer 解码器包括初始嵌入层、单独的 Transformer 解码器块以及最终的线性层和 Softmax 层。在 TransformerDecoder 中线性层执行线性变换映射输出维度以匹配词汇表大小。随后应用softmax 层将输出转换为词汇表上的概率分布。整个过程被封装在转发方法中该方法指示通过解码器的数据流。 # Transformer Decoderclass TransformerDecoder(nn.Module):def __init__(self, vocab_size, d_model, num_heads, ff_hidden_layer, dropout):super(TransformerDecoder, self).__init__()self.embedding nn.Embedding(vocab_size, d_model)self.pos_encoder PositionalEncoding(d_model, dropout)self.transformer_block DecoderBlock(d_model, num_heads, ff_hidden_layer, dropout)self.linear nn.Linear(d_model, vocab_size)self.softmax nn.LogSoftmax(dim-1)def forward(self, x):x self.embedding(x)x self.pos_encoder(x)tgt_mask generate_square_subsequent_mask(x.size(0))x self.transformer_block(x,tgt_mask)output self.linear(x)output self.softmax(output)return output 首先让我们建立初始解码器。为此我们首先概述我们的超参数。接下来我们将构造一个表示批量大小和上下文长度的张量。之后我们将继续对模型进行前向传播。生成的输出将采用张量的形式。最后我们将使用“argmax”函数来提取预测的单词索引。 # Define the hyperparameters vocab_size 1000 d_model 512 num_heads 1 ff_hidden_layer 2*d_model dropout 0.1 num_layers 10 context_length 50 batch_size 1 # Initialize the model model TransformerDecoder(vocab_size, d_model, num_heads, ff_hidden_layer, dropout)# Create a tensor representing batch size and context length input_tensor torch.randint(0, vocab_size, (context_length, batch_size))# Forward pass through the model output model(input_tensor)print(output.shape) # To get the predicted word indices, we can use the argmax function predicted_indices output.argmax(dim-1)print(predicted_indices.shape) 现在计算参数。 def count_parameters(model):return sum(p.numel() for p in model.parameters() if p.requires_grad)print(fThe model has {count_parameters(model):,} trainable parameters) 为了查看输出我们会将对数概率转换为概率并将输出张量转换为 numpy 数组。 # Convert the log probabilities to probabilities distribution torch.exp(output[0, 0, :])# Convert the output tensor to numpy array distribution distribution.detach().numpy()# Plot the distribution plt.figure(figsize(12, 6)) plt.bar(np.arange(vocab_size), distribution) plt.xlabel(Word Index) plt.ylabel(Probability) plt.title(Output Distribution over Vocabulary) plt.show() 现在制作一个多层解码器这将以层数作为参数。 class MultiLayerTransformerDecoder(nn.Module):def __init__(self, vocab_size, d_model, num_heads, ff_hidden_layer, dropout, num_layers):super(MultiLayerTransformerDecoder, self).__init__()self.embedding nn.Embedding(vocab_size, d_model)self.pos_encoder PositionalEncoding(d_model, dropout)self.transformer_blocks nn.ModuleList([DecoderBlock(d_model, num_heads, ff_hidden_layer, dropout)for _ in range(num_layers)])self.linear nn.Linear(d_model, vocab_size)self.softmax nn.LogSoftmax(dim-1)def forward(self, x):x self.embedding(x)x self.pos_encoder(x)for transformer_block in self.transformer_blocks:target_mask generate_square_subsequent_mask(x.size(0))x transformer_block(x,target_mask)output self.linear(x)output self.softmax(output)return output 遵循相同的过程。 # Define the hyperparameters vocab_size 10000 d_model 2048 num_heads 2 ff_hidden_layer 8*d_model dropout 0.1 num_layers 20 context_length 1000 batch_size 1# Create our input to the model to process input_tensor torch.randint(0, vocab_size, (context_length, batch_size))# Initialize the model with num_layer layers model MultiLayerTransformerDecoder(vocab_size, d_model, num_heads, ff_hidden_layer, dropout, num_layers)# Print the number of trainable parameters print(fThe model has {count_parameters(model):,} trainable parameters)# Lets use the same input_tensor from the previous example output model(input_tensor)# Convert the log probabilities to probabilities for the first sequence in the batch and the first position in the sequence distribution torch.exp(output[0, 0, :])# Convert the output tensor to numpy array distribution distribution.detach().numpy()# Now plot the distribution plt.figure(figsize(12, 6)) plt.bar(np.arange(vocab_size), distribution) plt.xlabel(Word Index) plt.ylabel(Probability) plt.title(Output Distribution over Vocabulary) plt.show() 你可以看到你的模型。 MultiLayerTransformerDecoder((embedding): Embedding(10000, 2048)(pos_encoder): PositionalEncoding((dropout): Dropout(p0.1, inplaceFalse))(transformer_blocks): ModuleList((0-19): 20 x DecoderBlock((self_attention): MultiheadAttention((out_proj): NonDynamicallyQuantizableLinear(in_features2048, out_features2048, biasTrue))(norm1): LayerNorm((2048,), eps1e-05, elementwise_affineTrue)(dropout1): Dropout(p0.1, inplaceFalse)(linear1): Linear(in_features2048, out_features16384, biasTrue)(linear2): Linear(in_features16384, out_features2048, biasTrue)(norm2): LayerNorm((2048,), eps1e-05, elementwise_affineTrue)(dropout2): Dropout(p0.1, inplaceFalse)))(linear): Linear(in_features2048, out_features10000, biasTrue)(softmax): LogSoftmax(dim-1) ) 现在您必须使用任何您想要用自己的仅解码器变压器模型进行实验的数据集并且您将拥有自己的 GPT。玩得开心代码阿克里蒂·乌帕迪亚七、结论创建我们自己的类似 GPT 的模型的过程包括理解架构、在代码中实现它以及使用数据集进行实验和微调。这次旅程让我们能够释放创造力并探索令人兴奋的自然语言处理世界。构建自定义 GPT 不仅是一项技术成就而且还邀请您享受乐趣并探索文本生成的无限可能性。

查看全文

http://www.huolong8.cn/news/414832/