Update README.md

272ed61 verified about 2 months ago

7.23 kB

	---
	language:
	- az
	license: apache-2.0
	tags:
	- sentence-transformers
	- feature-extraction
	- sentence-similarity
	- retrieval
	- azerbaijani
	- embedding
	library_name: sentence-transformers
	pipeline_tag: sentence-similarity
	datasets:
	- LocalDoc/msmarco-az-reranked
	- LocalDoc/azerbaijani_retriever_corpus-reranked
	- LocalDoc/ldquad_v2_retrieval-reranked
	- LocalDoc/azerbaijani_books_retriever_corpus-reranked
	base_model: intfloat/multilingual-e5-small
	model-index:
	- name: LocRet-small
	results:
	- task:
	type: retrieval
	dataset:
	name: AZ-MIRAGE
	type: custom
	metrics:
	- type: mrr@10
	value: 0.5250
	- type: ndcg@10
	value: 0.6162
	- type: recall@10
	value: 0.8948
	---

	# LocRet-small — Azerbaijani Retrieval Embedding Model

	LocRet-small is a compact, high-performance retrieval embedding model specialized for the Azerbaijani language. Despite being 4.8× smaller than BGE-m3, it significantly outperforms it on Azerbaijani retrieval benchmarks.

	## Key Results

	### AZ-MIRAGE Benchmark (Native Azerbaijani Retrieval)

	\| Rank \| Model \| Parameters \| MRR@10 \| P@1 \| R@5 \| R@10 \| NDCG@5 \| NDCG@10 \|
	\|:----:\|:------\|:---------:\|:------:\|:---:\|:---:\|:----:\|:------:\|:-------:\|
	\| #1 \| LocRet-small \| 118M \| 0.5250 \| 0.3132 \| 0.8267 \| 0.8948 \| 0.5938 \| 0.6162 \|
	\| #2 \| BAAI/bge-m3 \| 568M \| 0.4204 \| 0.2310 \| 0.6905 \| 0.7787 \| 0.4791 \| 0.5079 \|
	\| #3 \| perplexity-ai/pplx-embed-v1-0.6b \| 600M \| 0.4117 \| 0.2276 \| 0.6715 \| 0.7605 \| 0.4677 \| 0.4968 \|
	\| #4 \| intfloat/multilingual-e5-large \| 560M \| 0.4043 \| 0.2264 \| 0.6571 \| 0.7454 \| 0.4584 \| 0.4875 \|
	\| #5 \| intfloat/multilingual-e5-base \| 278M \| 0.3852 \| 0.2116 \| 0.6353 \| 0.7216 \| 0.4390 \| 0.4672 \|
	\| #6 \| Snowflake/snowflake-arctic-embed-l-v2.0 \| 568M \| 0.3746 \| 0.2135 \| 0.6006 \| 0.6916 \| 0.4218 \| 0.4516 \|
	\| #7 \| Qwen/Qwen3-Embedding-4B \| 4B \| 0.3602 \| 0.1869 \| 0.6067 \| 0.7036 \| 0.4119 \| 0.4437 \|
	\| #8 \| intfloat/multilingual-e5-small (base) \| 118M \| 0.3586 \| 0.1958 \| 0.5927 \| 0.6834 \| 0.4079 \| 0.4375 \|
	\| #9 \| Qwen/Qwen3-Embedding-0.6B \| 600M \| 0.2951 \| 0.1516 \| 0.4926 \| 0.5956 \| 0.3339 \| 0.3676 \|


	## Usage

	```python
	from sentence_transformers import SentenceTransformer

	model = SentenceTransformer("LocalDoc/LocRet-small")

	queries = ["Azərbaycanın paytaxtı hansı şəhərdir?"]
	passages = [
	"Bakı Azərbaycan Respublikasının paytaxtı və ən böyük şəhəridir.",
	"Gəncə Azərbaycanın ikinci böyük şəhəridir.",
	]

	query_embeddings = model.encode_query(queries)
	passage_embeddings = model.encode_document(passages)

	similarities = model.similarity(query_embeddings, passage_embeddings)
	print(similarities)
	```
	Prefixes `"query: "` and `"passage: "` are applied automatically via encode_query and encode_document. If using model.encode directly, the `"passage: "` prefix is added by default.


	## Training

	### Method

	LocRet-small is fine-tuned from [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) using listwise KL distillation combined with a contrastive loss:

	$$\mathcal{L} = \mathcal{L}_{\text{KL}} + 0.1 \cdot \mathcal{L}_{\text{InfoNCE}}$$

	- Listwise KL divergence: Distills the ranking distribution from a cross-encoder teacher ([bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3)) over candidate lists of 1 positive + up to 10 hard negatives per query. Teacher and student softmax distributions use asymmetric temperatures (τ_teacher = 0.3, τ_student = 0.05).
	- In-batch contrastive loss (InfoNCE): Provides additional diversity through in-batch negatives on positive passages.

	This approach preserves the full teacher ranking signal rather than reducing it to binary relevance labels, which is critical for training on top of already strong pre-trained retrievers.

	### Data

	The model was trained on approximately 3.5 million Azerbaijani query-passage pairs from four datasets:

	\| Dataset \| Pairs \| Domain \| Type \|
	\|:--------\|------:\|:-------\|:-----\|
	\| [msmarco-az-reranked](https://huggingface.co/datasets/LocalDoc/msmarco-az-reranked) \| ~1.4M \| General web QA \| Translated EN→AZ \|
	\| [azerbaijani_books_retriever_corpus-reranked](https://huggingface.co/datasets/LocalDoc/azerbaijani_books_retriever_corpus-reranked) \| ~1.6M \| Books, politics, history \| Native AZ \|
	\| [azerbaijani_retriever_corpus-reranked](https://huggingface.co/datasets/LocalDoc/azerbaijani_retriever_corpus-reranked) \| ~189K \| News, culture \| Native AZ \|
	\| [ldquad_v2_retrieval-reranked](https://huggingface.co/datasets/LocalDoc/ldquad_v2_retrieval-reranked) \| ~330K \| Wikipedia QA \| Native AZ \|

	All datasets include hard negatives scored by a cross-encoder reranker, which serve as the teacher signal for listwise distillation. False negatives were filtered using normalized score thresholds.

	### Hyperparameters

	\| Parameter \| Value \|
	\|:----------\|:------\|
	\| Base model \| intfloat/multilingual-e5-small \|
	\| Max sequence length \| 512 \|
	\| Effective batch size \| 256 \|
	\| Learning rate \| 5e-5 \|
	\| Schedule \| Linear warmup (5%) + cosine decay \|
	\| Precision \| FP16 \|
	\| Epochs \| 1 \|
	\| Training time \| ~25 hours \|
	\| Hardware \| 4× NVIDIA RTX 5090 (32GB) \|

	### Training Insights

	- Listwise KL distillation outperforms standard contrastive training (MultipleNegativesRankingLoss) for fine-tuning pre-trained retrievers, consistent with findings from [Arctic-Embed 2.0](https://arxiv.org/abs/2412.04506) and [cadet-embed](https://arxiv.org/abs/2505.19274).
	- Retrieval pre-training matters more than language-specific pre-training for retrieval tasks: multilingual-e5-small (with retrieval pre-training) significantly outperforms XLM-RoBERTa and other BERT variants (without retrieval pre-training) as a base model.
	- A mix of translated and native data prevents catastrophic forgetting while enabling language specialization.

	## Benchmark

	### AZ-MIRAGE

	A native Azerbaijani retrieval benchmark (https://github.com/LocalDoc-Azerbaijan/AZ-MIRAGE) with 7,373 queries and 40,448 document chunks covering diverse topics. Evaluates retrieval quality on naturally written Azerbaijani text.

	## Model Details

	\| Property \| Value \|
	\|:---------\|:------\|
	\| Architecture \| BERT (XLM-RoBERTa) \|
	\| Parameters \| 118M \|
	\| Embedding dimension \| 384 \|
	\| Max tokens \| 512 \|
	\| Vocabulary \| SentencePiece (250K) \|
	\| Similarity function \| Cosine similarity \|
	\| Language \| Azerbaijani (az) \|
	\| License \| Apache 2.0 \|

	## Limitations

	- Optimized for Azerbaijani text retrieval. Performance on other languages may be lower than the base multilingual-e5-small model.
	- Maximum input length is 512 tokens. Longer documents should be chunked.

	## Citation

	```bibtex
	@misc{locret-small-2026,
	title={LocRet-small: A Compact Azerbaijani Retrieval Embedding Model},
	author={LocalDoc},
	year={2026},
	url={https://huggingface.co/LocalDoc/LocRet-small}
	}
	```

	## Acknowledgments

	- Base model: [intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small)
	- Teacher reranker: [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3)
	- Training methodology inspired by [Arctic-Embed 2.0](https://arxiv.org/abs/2412.04506) and cross-encoder listwise distillation research.