paint-brush
Improving Text Embeddings with Large Language Models: Trainingby@autoencoder
131 reads

Improving Text Embeddings with Large Language Models: Training

by Auto Encoder: How to Ignore the Signal Noise
Auto Encoder: How to Ignore the Signal Noise HackerNoon profile picture

Auto Encoder: How to Ignore the Signal Noise

@autoencoder

Research & publications on Auto Encoders, revolutionizing data compression and...

March 1st, 2025
Read on Terminal Reader
Read this story in a terminal
Print this story
Read this story w/o Javascript
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

This paper introduces a novel method for generating high-quality text embeddings using synthetic data, achieving state-of-the-art results with minimal training

People Mentioned

Mention Thumbnail

Synthetic Data

@syntheticdata

Companies Mentioned

Mention Thumbnail
Abstract
Mention Thumbnail
Microsoft
featured image - Improving Text Embeddings with
Large Language Models: Training
1x
Read by Dr. One voice-avatar

Listen to this story

Auto Encoder: How to Ignore the Signal Noise HackerNoon profile picture
Auto Encoder: How to Ignore the Signal Noise

Auto Encoder: How to Ignore the Signal Noise

@autoencoder

Research & publications on Auto Encoders, revolutionizing data compression and feature learning techniques.

Learn More
LEARN MORE ABOUT @AUTOENCODER'S
EXPERTISE AND PLACE ON THE INTERNET.
0-item

STORY’S CREDIBILITY

Academic Research Paper

Academic Research Paper

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Authors:

(1) Liang Wang, Microsoft Corporation, and Correspondence to (wangliang@microsoft.com);

(2) Nan Yang, Microsoft Corporation, and correspondence to (nanya@microsoft.com);

(3) Xiaolong Huang, Microsoft Corporation;

(4) Linjun Yang, Microsoft Corporation;

(5) Rangan Majumder, Microsoft Corporation;

(6) Furu Wei, Microsoft Corporation and Correspondence to (fuwei@microsoft.com).

Abstract and 1 Introduction

2 Related Work

3 Method

3.1 Synthetic Data Generation

3.2 Training

4 Experiments

4.1 Statistics of the Synthetic Data

4.2 Model Fine-tuning and Evaluation

4.3 Main Results

4.4 Multilingual Retrieval

5 Analysis

5.1 Is Contrastive Pre-training Necessary?

5.2 Extending to Long Text Embeddings and 5.3 Analysis of Training Hyperparameters

6 Conclusion and References

A Implementation Details

B Test Set Contamination Analysis

C Prompts for Synthetic Data Generation

D Instructions for Training and Evaluation

3.2 Training

image


This paper is available on arxiv under CC0 1.0 DEED license.


L O A D I N G
. . . comments & more!

About Author

Auto Encoder: How to Ignore the Signal Noise HackerNoon profile picture
Auto Encoder: How to Ignore the Signal Noise@autoencoder
Research & publications on Auto Encoders, revolutionizing data compression and feature learning techniques.

TOPICS

THIS ARTICLE WAS FEATURED IN...

Permanent on Arweave
Read on Terminal Reader
Read this story in a terminal
 Terminal
Read this story w/o Javascript
Read this story w/o Javascript
 Lite
Also published here
Hackernoon
X
Threads
Bsky

Mentioned in this story

profiles
X REMOVE AD