Skip to content
Snippets Groups Projects
Commit 849c8fc3 authored by Jingbo He's avatar Jingbo He
Browse files

Update 5 files

- /Solution/inference_script.py
- /Solution/inference_cli.py
- /Solution/inference.py
- /Solution/inference_CLI.py
- /Solution/Solution.md
parent c0b6e24d
No related branches found
No related tags found
1 merge request!1Update 14 files
# Solution for tasks
## 1.Sequence Inference ## 1.Sequence Inference
Using DNABERT-2 from Hugging Face to calculate embedding of DNA sequence: AAGTCGTTACGGTACCGTAGCTTACGGCATTA Using DNABERT-2 from Hugging Face to calculate embedding of DNA sequence: AAGTCGTTACGGTACCGTAGCTTACGGCATTA
...@@ -10,14 +8,51 @@ Using DNABERT-2 from Hugging Face to calculate embedding of DNA sequence: AAGTCG ...@@ -10,14 +8,51 @@ Using DNABERT-2 from Hugging Face to calculate embedding of DNA sequence: AAGTCG
import torch import torch
from transformers import BertModel, AutoTokenizer from transformers import BertModel, AutoTokenizer
``` ```
Can't load `DNABERT-2` model using `AutoModel` class, since `DNABERT-2` is a custom model.
### 1.2 Load tokenizer and model ### 1.2 Load tokenizer and model from Hugging Face
```python ```python
tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M") tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M")
model = BertModel.from_pretrained("zhihan1996/DNABERT-2-117M") model = BertModel.from_pretrained("zhihan1996/DNABERT-2-117M")
``` ```
### 1.3 Define and tokenize DNA sequence
Define the DNA sequence for inference
```python
dna_sequence = "AAGTCGTTACGGTACCGTAGCTTACGGCATTA"
```
Tokenize the input sequence
```python
inputs = tokenizer(dna_sequence, return_tensors='pt')["input_ids"]
```
### Calculate embedding
Run the model to get hidden states
```python
with torch.no_grad(): # Disable gradient calculations for inference
hidden_states = model(inputs)[0] # Shape: [1, sequence_length, 768]
```
Pooling to get a single embedding vector
```python
embedding_mean = torch.mean(hidden_states[0], dim=0)
```
Print the resulting embedding
```python
print("Embedding shape:", embedding_mean.shape)
print("Embedding vector:", embedding_mean)
```
## 3. Docker Container Usage ## 3. Docker Container Usage
Containerize the inference process using Docker. Containerize the inference process using Docker.
...@@ -87,3 +122,5 @@ docker run -it dnabert_inference ...@@ -87,3 +122,5 @@ docker run -it dnabert_inference
`-t`: Allocates a pseudo-TTY. `-t`: Allocates a pseudo-TTY.
## 4. Evaluation of the Success
import torch import torch
from transformers import BertModel, AutoTokenizer from transformers import BertModel, AutoTokenizer
# Load model and tokenizer # Load the tokenizer and model from Hugging Face
tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M") tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
model = BertModel.from_pretrained("zhihan1996/DNABERT-2-117M") model = AutoModel.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
# Define DNA sequence # Define the DNA sequence for inference
dna_sequence = "AAGTCGTTACGGTACCGTAGCTTACGGCATTA" dna_sequence = "AAGTCGTTACGGTACCGTAGCTTACGGCATTA"
# Tokenize the sequence # Tokenize the input sequence
inputs = tokenizer(dna_sequence, return_tensors = 'pt')["input_ids"]
hidden_states = model(inputs)[0] # [1, sequence_length, 768]
# embedding with mean pooling
embedding_mean = torch.mean(hidden_states[0], dim=0)
print(embedding_mean.shape) # expect to be 768
print("Mean Embedding vector:", embedding_mean)
# Define the DNA sequence
dna_sequence = "AAGTCGTTACGGTACCGTAGCTTACGGCATTA"
# Tokenize the sequence
inputs = tokenizer(dna_sequence, return_tensors='pt')["input_ids"] inputs = tokenizer(dna_sequence, return_tensors='pt')["input_ids"]
# Run inference # Run the model to get hidden states
with torch.no_grad(): with torch.no_grad(): # Disable gradient calculations for inference
hidden_states = model(inputs)[0] hidden_states = model(inputs)[0] # Shape: [1, sequence_length, 768]
# Pool the hidden states # Pooling to get a single embedding vector
embedding_mean = torch.mean(hidden_states[0], dim=0) embedding_mean = torch.mean(hidden_states[0], dim=0)
# Print the output # Print the resulting embedding
print("Embedding shape:", embedding_mean.shape) print("Embedding shape:", embedding_mean.shape)
print("Embedding vector:", embedding_mean) print("Embedding vector:", embedding_mean)
\ No newline at end of file
File moved
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment