Sentence Similarity
sentence-transformers
Safetensors
Azerbaijani
bert
feature-extraction
retrieval
azerbaijani
embedding
Eval Results (legacy)
text-embeddings-inference
Instructions to use LocalDoc/LocRet-small with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use LocalDoc/LocRet-small with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("LocalDoc/LocRet-small") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
| language: | |
| - az | |
| license: apache-2.0 | |
| tags: | |
| - sentence-transformers | |
| - feature-extraction | |
| - sentence-similarity | |
| - retrieval | |
| - azerbaijani | |
| - embedding | |
| library_name: sentence-transformers | |
| pipeline_tag: sentence-similarity | |
| datasets: | |
| - LocalDoc/msmarco-az-reranked | |
| - LocalDoc/azerbaijani_retriever_corpus-reranked | |
| - LocalDoc/ldquad_v2_retrieval-reranked | |
| - LocalDoc/azerbaijani_books_retriever_corpus-reranked | |
| base_model: intfloat/multilingual-e5-small | |
| model-index: | |
| - name: LocRet-small | |
| results: | |
| - task: | |
| type: retrieval | |
| dataset: | |
| name: AZ-MIRAGE | |
| type: custom | |
| metrics: | |
| - type: mrr@10 | |
| value: 0.5250 | |
| - type: ndcg@10 | |
| value: 0.6162 | |
| - type: recall@10 | |
| value: 0.8948 | |
| # LocRet-small — Azerbaijani Retrieval Embedding Model | |
| **LocRet-small** is a compact, high-performance retrieval embedding model specialized for the Azerbaijani language. Despite being **4.8× smaller** than BGE-m3, it significantly outperforms it on Azerbaijani retrieval benchmarks. | |
| ## Key Results | |
| ### AZ-MIRAGE Benchmark (Native Azerbaijani Retrieval) | |
| | Rank | Model | Parameters | MRR@10 | P@1 | R@5 | R@10 | NDCG@5 | NDCG@10 | | |
| |:----:|:------|:---------:|:------:|:---:|:---:|:----:|:------:|:-------:| | |
| | **#1** | **LocRet-small** | **118M** | **0.5250** | **0.3132** | **0.8267** | **0.8948** | **0.5938** | **0.6162** | | |
| | #2 | BAAI/bge-m3 | 568M | 0.4204 | 0.2310 | 0.6905 | 0.7787 | 0.4791 | 0.5079 | | |
| | #3 | perplexity-ai/pplx-embed-v1-0.6b | 600M | 0.4117 | 0.2276 | 0.6715 | 0.7605 | 0.4677 | 0.4968 | | |
| | #4 | intfloat/multilingual-e5-large | 560M | 0.4043 | 0.2264 | 0.6571 | 0.7454 | 0.4584 | 0.4875 | | |
| | #5 | intfloat/multilingual-e5-base | 278M | 0.3852 | 0.2116 | 0.6353 | 0.7216 | 0.4390 | 0.4672 | | |
| | #6 | Snowflake/snowflake-arctic-embed-l-v2.0 | 568M | 0.3746 | 0.2135 | 0.6006 | 0.6916 | 0.4218 | 0.4516 | | |
| | #7 | Qwen/Qwen3-Embedding-4B | 4B | 0.3602 | 0.1869 | 0.6067 | 0.7036 | 0.4119 | 0.4437 | | |
| | #8 | intfloat/multilingual-e5-small (base) | 118M | 0.3586 | 0.1958 | 0.5927 | 0.6834 | 0.4079 | 0.4375 | | |
| | #9 | Qwen/Qwen3-Embedding-0.6B | 600M | 0.2951 | 0.1516 | 0.4926 | 0.5956 | 0.3339 | 0.3676 | | |
| ## Usage | |
| ```python | |
| from sentence_transformers import SentenceTransformer | |
| model = SentenceTransformer("LocalDoc/LocRet-small") | |
| queries = ["Azərbaycanın paytaxtı hansı şəhərdir?"] | |
| passages = [ | |
| "Bakı Azərbaycan Respublikasının paytaxtı və ən böyük şəhəridir.", | |
| "Gəncə Azərbaycanın ikinci böyük şəhəridir.", | |
| ] | |
| query_embeddings = model.encode_query(queries) | |
| passage_embeddings = model.encode_document(passages) | |
| similarities = model.similarity(query_embeddings, passage_embeddings) | |
| print(similarities) | |
| ``` | |
| Prefixes `"query: "` and `"passage: "` are applied automatically via encode_query and encode_document. If using model.encode directly, the `"passage: "` prefix is added by default. | |
| ## Training | |
| ### Method | |
| LocRet-small is fine-tuned from [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) using **listwise KL distillation** combined with a contrastive loss: | |
| $$\mathcal{L} = \mathcal{L}_{\text{KL}} + 0.1 \cdot \mathcal{L}_{\text{InfoNCE}}$$ | |
| - **Listwise KL divergence**: Distills the ranking distribution from a cross-encoder teacher ([bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3)) over candidate lists of 1 positive + up to 10 hard negatives per query. Teacher and student softmax distributions use asymmetric temperatures (τ_teacher = 0.3, τ_student = 0.05). | |
| - **In-batch contrastive loss (InfoNCE)**: Provides additional diversity through in-batch negatives on positive passages. | |
| This approach preserves the full teacher ranking signal rather than reducing it to binary relevance labels, which is critical for training on top of already strong pre-trained retrievers. | |
| ### Data | |
| The model was trained on approximately **3.5 million** Azerbaijani query-passage pairs from four datasets: | |
| | Dataset | Pairs | Domain | Type | | |
| |:--------|------:|:-------|:-----| | |
| | [msmarco-az-reranked](https://huggingface.co/datasets/LocalDoc/msmarco-az-reranked) | ~1.4M | General web QA | Translated EN→AZ | | |
| | [azerbaijani_books_retriever_corpus-reranked](https://huggingface.co/datasets/LocalDoc/azerbaijani_books_retriever_corpus-reranked) | ~1.6M | Books, politics, history | Native AZ | | |
| | [azerbaijani_retriever_corpus-reranked](https://huggingface.co/datasets/LocalDoc/azerbaijani_retriever_corpus-reranked) | ~189K | News, culture | Native AZ | | |
| | [ldquad_v2_retrieval-reranked](https://huggingface.co/datasets/LocalDoc/ldquad_v2_retrieval-reranked) | ~330K | Wikipedia QA | Native AZ | | |
| All datasets include hard negatives scored by a cross-encoder reranker, which serve as the teacher signal for listwise distillation. False negatives were filtered using normalized score thresholds. | |
| ### Hyperparameters | |
| | Parameter | Value | | |
| |:----------|:------| | |
| | Base model | intfloat/multilingual-e5-small | | |
| | Max sequence length | 512 | | |
| | Effective batch size | 256 | | |
| | Learning rate | 5e-5 | | |
| | Schedule | Linear warmup (5%) + cosine decay | | |
| | Precision | FP16 | | |
| | Epochs | 1 | | |
| | Training time | ~25 hours | | |
| | Hardware | 4× NVIDIA RTX 5090 (32GB) | | |
| ### Training Insights | |
| - **Listwise KL distillation outperforms standard contrastive training** (MultipleNegativesRankingLoss) for fine-tuning pre-trained retrievers, consistent with findings from [Arctic-Embed 2.0](https://arxiv.org/abs/2412.04506) and [cadet-embed](https://arxiv.org/abs/2505.19274). | |
| - **Retrieval pre-training matters more than language-specific pre-training** for retrieval tasks: multilingual-e5-small (with retrieval pre-training) significantly outperforms XLM-RoBERTa and other BERT variants (without retrieval pre-training) as a base model. | |
| - **A mix of translated and native data** prevents catastrophic forgetting while enabling language specialization. | |
| ## Benchmark | |
| ### AZ-MIRAGE | |
| A native Azerbaijani retrieval benchmark (https://github.com/LocalDoc-Azerbaijan/AZ-MIRAGE) with 7,373 queries and 40,448 document chunks covering diverse topics. Evaluates retrieval quality on naturally written Azerbaijani text. | |
| ## Model Details | |
| | Property | Value | | |
| |:---------|:------| | |
| | Architecture | BERT (XLM-RoBERTa) | | |
| | Parameters | 118M | | |
| | Embedding dimension | 384 | | |
| | Max tokens | 512 | | |
| | Vocabulary | SentencePiece (250K) | | |
| | Similarity function | Cosine similarity | | |
| | Language | Azerbaijani (az) | | |
| | License | Apache 2.0 | | |
| ## Limitations | |
| - Optimized for Azerbaijani text retrieval. Performance on other languages may be lower than the base multilingual-e5-small model. | |
| - Maximum input length is 512 tokens. Longer documents should be chunked. | |
| ## Citation | |
| ```bibtex | |
| @misc{locret-small-2026, | |
| title={LocRet-small: A Compact Azerbaijani Retrieval Embedding Model}, | |
| author={LocalDoc}, | |
| year={2026}, | |
| url={https://huggingface.co/LocalDoc/LocRet-small} | |
| } | |
| ``` | |
| ## Acknowledgments | |
| - Base model: [intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | |
| - Teacher reranker: [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) | |
| - Training methodology inspired by [Arctic-Embed 2.0](https://arxiv.org/abs/2412.04506) and cross-encoder listwise distillation research. |