Update 5 files

- /Solution/inference_script.py - /Solution/inference_cli.py - /Solution/inference.py - /Solution/inference_CLI.py - /Solution/Solution.md

Update 5 files
849c8fc3 · Jingbo He · c0b6e24d · 849c8fc3 · 849c8fc3 · 849c8fc3
Commit 849c8fc3 authored 6 months ago by Jingbo He
--- a/Solution/Solution.md
+++ b/Solution/Solution.md
-# Solution for tasks
 ## 1.Sequence Inference
 Using DNABERT-2 from Hugging Face to calculate embedding of DNA sequence: AAGTCGTTACGGTACCGTAGCTTACGGCATTA
@@ -10,14 +8,51 @@ Using DNABERT-2 from Hugging Face to calculate embedding of DNA sequence: AAGTCG
 import torch
 from transformers import BertModel, AutoTokenizer
 ```
+Can't load `DNABERT-2` model using `AutoModel` class, since `DNABERT-2` is a custom model.
-### 1.2 Load tokenizer and model
+### 1.2 Load tokenizer and model from Hugging Face
 ```python
 tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M")
 model = BertModel.from_pretrained("zhihan1996/DNABERT-2-117M")
 ```
+### 1.3 Define and tokenize DNA sequence
+Define the DNA sequence for inference
+```python
+dna_sequence = "AAGTCGTTACGGTACCGTAGCTTACGGCATTA"
+```
+Tokenize the input sequence
+```python
+inputs = tokenizer(dna_sequence, return_tensors='pt')["input_ids"]
+```
+### Calculate embedding
+Run the model to get hidden states
+```python
+with torch.no_grad():  # Disable gradient calculations for inference
+    hidden_states = model(inputs)[0]  # Shape: [1, sequence_length, 768]
+```
+Pooling to get a single embedding vector
+```python
+embedding_mean = torch.mean(hidden_states[0], dim=0)
+```
+Print the resulting embedding
+```python
+print("Embedding shape:", embedding_mean.shape)
+print("Embedding vector:", embedding_mean)
+```
 ## 3. Docker Container Usage
 Containerize the inference process using Docker.
@@ -87,3 +122,5 @@ docker run -it dnabert_inference
 `-t`: Allocates a pseudo-TTY.
+## 4. Evaluation of the Success
--- a/Solution/inference_script.py
+++ b/Solution/inference_script.py
 import torch
 from transformers import BertModel, AutoTokenizer
-# Load model and tokenizer
+# Load the tokenizer and model from Hugging Face
-tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M")
+tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
-model = BertModel.from_pretrained("zhihan1996/DNABERT-2-117M")
+model = AutoModel.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
-# Define DNA sequence
+# Define the DNA sequence for inference
 dna_sequence = "AAGTCGTTACGGTACCGTAGCTTACGGCATTA"
-# Tokenize the sequence
+# Tokenize the input sequence
-inputs = tokenizer(dna_sequence, return_tensors = 'pt')["input_ids"]
-hidden_states = model(inputs)[0] # [1, sequence_length, 768]
-# embedding with mean pooling
-embedding_mean = torch.mean(hidden_states[0], dim=0)
-print(embedding_mean.shape) # expect to be 768
-print("Mean Embedding vector:", embedding_mean)
-# Define the DNA sequence
-dna_sequence = "AAGTCGTTACGGTACCGTAGCTTACGGCATTA"
-# Tokenize the sequence
 inputs = tokenizer(dna_sequence, return_tensors='pt')["input_ids"]
-# Run inference
+# Run the model to get hidden states
-with torch.no_grad():
+with torch.no_grad():  # Disable gradient calculations for inference
-    hidden_states = model(inputs)[0]
+    hidden_states = model(inputs)[0]  # Shape: [1, sequence_length, 768]
-# Pool the hidden states
+# Pooling to get a single embedding vector
 embedding_mean = torch.mean(hidden_states[0], dim=0)
-# Print the output
+# Print the resulting embedding
 print("Embedding shape:", embedding_mean.shape)
 print("Embedding vector:", embedding_mean)
\ No newline at end of file
--- a/Solution/inference_cli.py
+++ b/Solution/inference_cli.py