Embedditor is an inovative solution inspired by the experiences of over 30,000 IngestAI users. Our insights revealed a common bottleneck in AI and LLM-related applications, one that goes beyond LLM hallucinations or token limits, which are far easier to resolve. The prevailing issue lies in the GIGO (garbage in, garbage out) principle. With no one-size-fits-all approach to chunking and embedding, certain models excel with individual sentences, while others thrive on chunks of 250 to 500 tokens.
Blindly splitting chunks by the quantity of characters or tokens, and embedding content without normalization and with up to 40% of redundant noise (such as punctuations, stop-words, and low-relevance frequent terms) often leads to suboptimal vector search results and low-performing LLM-related applications using semantic or generative search.
The issue was consisting in trying to enhance vector search using existing technologies, which proved to be as challenging for our users, as creating an outstanding document using a basic .txt format. We decided to address the root problem, so we developed Embedditor - the Microsoft Word equivalent for embedding pre-processing, enabling with no background in data science or technical skills to improve performance of their vector search capabilities while saving up to 40% on embedding and storage. We've made Embedditor open-source and accessible to all because we genuinely believe that by improving vector search performance and boosting cost-efficiency simultaneously, Embedditor may have significant impact on current NLP and LLM industry.