Uncategorized

Episode 104. It’s all about Apache Tika, the project that lets you index EVERYTHING.

April 19, 2024 Freddy Guime

So we continue to have guests in our show to talk to us about interesting things… This time is about Apache Tikka. This is an incredible tool to do search file processing and metadata extraction. Think about that you have tons of unstructured files, like emails, or documents, and you want to extract, index and then search theses. This is Tika’s purpose. And who best to walk us through how it does its magic that its Project Management Committee (PMC) Chair, Tim Allison!

So take a listen as we go deeper on ingesting tons of content (which is fundamental for things like training LLMs).

http://www.javapubhouse.com/datadog
We thank DataDogHQ for sponsoring this podcast episode

Don’t forget to SUBSCRIBE to our cool NewsCast OffHeap!
http://www.javaoffheap.com/

Apache Tika

https://tika.apache.org/

OpenSearch Project and OpenSearch Neural Plugin Tutorials

Selected Advanced File Processing toolkits/services

Selected Hybrid Search/RAG toolkits (there are MANY others!)

Search/Relevance Conferences

Tim’s personal project

JavaFX (ahem) tika-config writer UI: https://github.com/tballison/tika-gui-v2

Do you like the episodes? Want more? Help us out! Buy us a beer!
https://www.javapubhouse.com/beer

And Follow us!
https://www.twitter.com/javapubhouse

Episode 103. Let’s share data cross-language with Apache Arrow! (among other things)

We have a great time talking to Matt Topol from Voltron Data on one of his Apache Software Foundation projects called Apache Arrow. It’s both a spec and implementation of a columnar data format that is not only efficient, but cross-language compatible. We walk through the scenarios that it covers and how is becoming more and more pivotal for things like ML and LLMs. So come listen to this JPH episode on one of the best and free ways to distribute data and integrate services working on top of that data!

We thank DataDogHQ for sponsoring this podcast episode

Don’t forget to SUBSCRIBE to our cool NewsCast OffHeap!
http://www.javaoffheap.com/

Apache Arrow Project (https://arrow.apache.org/)
Java implementation (https://arrow.apache.org/docs/java/index.html)
In-Memory Analytics with Apache Arrow (https://www.oreilly.com/library/view/in-memory-analytics-with/9781801071031/)
Matt Topol X (Twitter!) Account (https://twitter.com/zeroshade)
Do you like the episodes? Want more? Help us out! Buy us a beer!
https://www.javapubhouse.com/beer

And Follow us!
https://www.twitter.com/javapubhouse

Episode 104. It’s all about Apache Tika, the project that lets you index EVERYTHING.

Episode 103. Let’s share data cross-language with Apache Arrow! (among other things)

Recent Posts

Recent Comments

Archives

Categories

Meta

Search

Pages