Qybera/pkl-video-audio
Updated • 75
How to use Qybera/LisaV3.0 with Keras:
# Available backend options are: "jax", "torch", "tensorflow".
import os
os.environ["KERAS_BACKEND"] = "jax"
import keras
model = keras.saving.load_model("hf://Qybera/LisaV3.0")
AdvancedLISA is a sophisticated multimodal AI model that combines advanced vision and audio processing with reasoning capabilities. The model provides comprehensive scene understanding, emotion recognition, and multimodal analysis.
| Component | Type | Parameters | Function |
|---|---|---|---|
| Vision Encoder | MultispectralVisionEncoder | 15,544,195 | Multispectral image processing + 3D spatial reasoning |
| Audio Encoder | AdvancedAudioEncoder | 29,479,243 | Audio analysis + emotion/speaker detection |
| Fusion Module | AdvancedFusionModule | 16,803,334 | Cross-modal attention and feature fusion |
| Reasoning Module | ReasoningModule | 68,231,168 | Transformer-based sequence reasoning |
| Voice Synthesis | IndependentVoiceSynthesis | 8,061,965 | Voice generation capabilities |
| Self Awareness | SelfAwarenessModule | 22,579,201 | Identity and context awareness |
| Conversation Memory | ConversationMemory | 6,823,937 | Persistent dialogue memory |
The model returns a comprehensive output dictionary:
{
'vision_analysis': {
'features': [batch, 30, 512], # Core vision features
'spatial_3d': [batch, 30, 6], # 3D spatial understanding
'scene': [batch, 30, 1000], # Scene classification
'objects': [batch, 30, 80], # Object detection
'motion': [batch, 30, 4] # Motion analysis
},
'audio_analysis': {
'features': [batch, 30, 1024], # Core audio features
'spatial': [batch, 30, 4], # Spatial audio
'emotion': [batch, 30, 7], # Emotion classification
'speaker': [batch, 30, 256], # Speaker characteristics
'content': [batch, 30, 128] # Content analysis
},
'reasoning': [batch, 30, 1024], # Fused reasoning output
'timestamp': float, # Processing timestamp
'rl_action': dict # Reinforcement learning actions
}
Note: GPU inference will be significantly faster
import torch
import json
from pathlib import Path
# Load model configuration
config_path = "Qybera/LisaV3.0/config.json"
with open(config_path, 'r') as f:
config = json.load(f)
# Import and create model (requires lisa_model.py)
from lisa_model import create_lisa_model
model_config = {
'model_config': {
'vision_channels': 5, # Multispectral input
'audio_channels': 1,
'vision_hidden': 512,
'audio_hidden': 512,
'fused_dim': 1024,
'voice_hidden': 512,
'vision_layers': 4,
'audio_layers': 4,
'reasoning_layers': 8,
'mel_bins': 80,
'max_memory': 50
},
'data_config': {
'frame_size': [224, 224],
'seq_len': 30,
'n_mels': 80
}
}
# Create and load model
model, device = create_lisa_model(model_config)
# Load trained weights
state_dict = torch.load("Qybera/LisaV3.0/pytorch_model.bin", map_location=device)
model.load_state_dict(state_dict)
model.eval()
# Prepare inputs (must be exactly sequence length 30)
vision_input = torch.randn(1, 30, 5, 224, 224).to(device) # 5-channel multispectral
audio_input = torch.randn(1, 30, 1, 80, 200).to(device) # Mel spectrograms
# Generate comprehensive analysis
with torch.no_grad():
output = model(vision_input, audio_input)
# Access different analysis components
vision_features = output['vision_analysis']['features'] # [1, 30, 512]
audio_emotions = output['audio_analysis']['emotion'] # [1, 30, 7]
reasoning_output = output['reasoning'] # [1, 30, 1024]
print(f"Vision features: {vision_features.shape}")
print(f"Detected emotions: {audio_emotions.shape}")
print(f"Reasoning output: {reasoning_output.shape}")
# Process multiple sequences
batch_size = 2
vision_batch = torch.randn(batch_size, 30, 5, 224, 224).to(device)
audio_batch = torch.randn(batch_size, 30, 1, 80, 200).to(device)
with torch.no_grad():
batch_output = model(vision_batch, audio_batch)
print(f"Batch processing: {batch_size} sequences")
print(f"Batch reasoning output: {batch_output['reasoning'].shape}")
# Access individual model components
vision_encoder = model.vision_encoder
audio_encoder = model.audio_encoder
reasoning_module = model.reasoning_module
# Use vision encoder separately
vision_analysis = vision_encoder(vision_input)
print("Vision analysis keys:", list(vision_analysis.keys()))
# Use audio encoder separately
audio_analysis = audio_encoder(audio_input)
print("Audio analysis keys:", list(audio_analysis.keys()))
⚠️ Important: The model expects exactly 30 frames/steps per sequence due to memory constraints.
[batch_size, 30, 5, 224, 224] - 5-channel multispectral images[batch_size, 30, 1, 80, 200] - Mel spectrograms with 80 frequency bins@model{advancedlisa2025,
title={AdvancedLISA: Multimodal Vision+Audio AI with Advanced Reasoning},
author={LISA Development Team},
year={2025},
url={https://github.com/elijahnzeli1/LISA3D}-private
}
Apache-2.0 License - see LICENSE file for details
Model card updated based on comprehensive testing - September 2025
Base model
Qybera/LisaV3