Learning Library

← Back to Library

Running Ollama: Local LLMs on Laptop

Key Points

  • Running large language models locally on your laptop eliminates cloud dependencies, ensuring full data privacy and giving developers direct control over AI resources.
  • Ollama provides a cross‑platform command‑line tool that lets you download, install, and serve quantized LLMs (e.g., from its model store) on macOS, Windows, or Linux.
  • The `ollama run` command both pulls the chosen model (like granite‑3.1‑dense) and starts a local inference server, exposing a standard API for chat and programmatic requests.
  • Local execution uses optimized back‑ends such as llama‑cpp, enabling even limited hardware to run compressed models efficiently.
  • The granite‑3.1 model showcased supports 11 languages, excels at enterprise tasks, and offers strong Retrieval‑Augmented Generation (RAG) capabilities for integrating proprietary data.

Full Transcript

# Running Ollama: Local LLMs on Laptop **Source:** [https://www.youtube.com/watch?v=uxE8FFiu_UQ](https://www.youtube.com/watch?v=uxE8FFiu_UQ) **Duration:** 00:05:47 ## Summary - Running large language models locally on your laptop eliminates cloud dependencies, ensuring full data privacy and giving developers direct control over AI resources. - Ollama provides a cross‑platform command‑line tool that lets you download, install, and serve quantized LLMs (e.g., from its model store) on macOS, Windows, or Linux. - The `ollama run` command both pulls the chosen model (like granite‑3.1‑dense) and starts a local inference server, exposing a standard API for chat and programmatic requests. - Local execution uses optimized back‑ends such as llama‑cpp, enabling even limited hardware to run compressed models efficiently. - The granite‑3.1 model showcased supports 11 languages, excels at enterprise tasks, and offers strong Retrieval‑Augmented Generation (RAG) capabilities for integrating proprietary data. ## Sections - [00:00:00](https://www.youtube.com/watch?v=uxE8FFiu_UQ&t=0s) **Running Local LLMs with Ollama** - The speaker introduces Ollama, an open‑source developer tool that lets you install and run large language models locally for privacy‑preserving, cloud‑free AI capabilities such as chat, code assistance, and RAG. - [00:03:13](https://www.youtube.com/watch?v=uxE8FFiu_UQ&t=193s) **Integrating Local LLM via LangChain** - The speaker demonstrates connecting a locally hosted Ollama model to a Java/Quarkus application using LangChain for Java, enabling standardized API calls to automate insurance claim processing. ## Full Transcript
0:00Hey, quick question. 0:01Did you know that you can run the latest large language models locally on your laptop? 0:06This means you don't have any dependencies on cloud services 0:09and you get full data privacy while using optimized models to chat, 0:14uses code assistance and integrate AI into your applications with RAG or even agentic behavior. 0:20So today we're taking a look at Ollama. 0:22It's a developer tool that has been quickly growing in popularity 0:25and we're gonna show you how you can start using it on your machine right now, 0:29but real quick, before we start installing things, what value does this open source project provide to you? 0:35Well, as a developer, traditionally I'd need to request computing resources 0:39or hardware to run something as intensive as a large language model. 0:43And to use cloud services involves sending my data to somebody else, which might not always be feasible. 0:49So by running models from my local machine, I can maintain full control over my AI and use a model through an API, 0:57just like I would with another service, like a database on my own system. 1:01Let's see this in action by switching over to my laptop and heading to ollama.com, 1:05and this is where you can install the command line tool for Mac, 1:09Windows, and of course, Linux, but also browse the repository of models. 1:13For example, foundation models from the leading AI labs, but also 1:17more fine tuned or task specific models such as code assistants. 1:21Which one should you use? 1:23Well, we'll take a look at that soon, 1:24but for now, I'll open up my terminal where Ollama has been installed, 1:28and the first step is downloading and chatting with a model locally. 1:31So now I have Ollama set up on my local machine. 1:35And what we're going to do first is use the Ollama run command, which is almost two commands in one. 1:40What's going to happen is it's going to pull the model from Olamma's model store, 1:44if we don't already have it, and also start up an inference server for us 1:48to make requests to the LLM that's running on our own machine. 1:51So let's go ahead and do that now. 1:53We're going to run Ollama run granite 3.1 dense, 1:56and so while we have a chat interface here where we could ask questions, behind the scenes, 2:00what we've done is downloaded a quantized 2:03or compressed version of a model that's capable of running on limited hardware, 2:07and we're also using a back end like llama C++ to run the model. 2:11So every time that we ask and chat with the model, for example, asking vim or emacs, 2:17What's happening is we're getting our response, but we're also making a post request to the API that's running on localhost. 2:25Pretty cool, right? 2:26So for our example, I ran the granite 3.1 model and as a 2:30developer, it has a lot of features that are quite interesting to me. 2:33So it supports 11 different languages so it could translate between Spanish and English and back and forth, 2:39and it's also optimized for enterprise specific tasks. 2:42This includes high benchmarks on RAG capabilities, 2:45which RAG allows us to use our unique data with the LLM by providing it in the context window of our queries, 2:51but also capabilities for agentic behavior and much more, 2:55but as always, it's good to keep your options open. 2:58The Ollama model catalog is quite impressive with models for embedding, vision, tools, and many more, 3:04but you could also import your own fine-tuned models, for example, 3:08or use them from Hugging Face by using what's known as the Ollama model file. 3:13So we've installed Olamma, we've chatted with the model running locally, and we've explored the model ecosystem, 3:18but there's a big question left, 3:20what about integrating an LLM like this into our existing application? 3:24So let me hop out of the chat window and let's make sure that the model is running locally on our system. 3:30So Ollama PS can show us the running models, 3:33and now that we have a model running on localhost, 3:36our application needs a way to communicate with this model in a standardized format. 3:40That's where we're going to be using what's known as Langchain 3:44and specifically Langchain for Java in our application, 3:47which is a framework that's grown in popularity and allows us to use 3:51a standardized API to make calls to the model from our application that's written in Java. 3:57Now, we're going to be using Quarkus, which is a Kubernetes optimized Java flavor 4:02that supports this Langchain for J extension in order to call our model from the application. 4:08Let's get started. 4:09So let's take a look at the application that we're currently working on. 4:12So I'll open it up here in the browser. 4:14Now, what's happening is that this fictitious organization Parasol 4:18is being overwhelmed by new insurance claims 4:22and could use the help of an AI, like a large language model, 4:26to help process this overwhelming amount of information and make better and quicker decisions, 4:31but how do we do that behind the scenes? 4:33So here in our project, we've added Lang chain for J as a dependency, and we're going to specify 4:38the URL as localhost on 4:41port 11434 in our application.properties, pointing to where our model is running on our machine. 4:47Now we're also gonna be using a web socket in order to make a post request to the model, 4:52and now our agents have AI capabilities, specifically a helpful assistant that can work with them to complete their job tasks. 5:00So let's ask the model to summarize the claim details. 5:03And there we go. 5:04In the form of tokens, we've made that request to the model. 5:08running with Ollama on our local machine and we're able to quickly prototype from our laptop. 5:13It's just as simple as that. 5:15So running AI locally can be really handy when it comes to prototyping, proof of concepts and much more, 5:21and another common use case is code assistance, 5:24connecting a locally running model to your IDE instead of using paid services. 5:29When it comes to production, however, you might need more advanced capabilities, but for getting started today, 5:34Ollama is a great pick for developers. 5:36So what are you working on or interested in? 5:39Let us know in the comments below, 5:40but thanks as always for watching and don't forget to like the video if you learned something today. 5:46Have a good one.