Skip to main content

Command Palette

Search for a command to run...

The "Local First" Approach: Testing AI with Ollama Before Going Mobile

Updated
2 min read
G
I'm a Senior Software developer who loves solving real-world problems and building meaningful products 💡 I currently focus on crafting clean, user-friendly experiences using React Native ⚛️ I enjoy working on challenging projects and constantly learning new things — whether it’s exploring a new framework or diving deeper into existing ones. This space is where I share my journey, the issues I tackle, and the lessons I pick up along the way 🚀

I’m using Ollama and Google’s latest Gemma 3 model. My goal is simple: Understand the difference between the /api/generate and /api/chat endpoints and learn how to bridge my React Native app to a local AI server.

esting in the terminal with curl is the fastest way to debug an LLM.

  • Ollama is the engine that serves the model.

  • Gemma 3 is the brain we are using.

Chat vs. Generate: Which one do you need?

  • /api/chat (The Easy Way): You send an array of messages with roles (system, user, assistant). Ollama handles the formatting for you. It’s perfect for building a ChatGPT-style interface.

  • /api/generate (The Raw Way): You send a single prompt string. It returns a context array (a bunch of numbers). If you want the AI to remember the next question, you have to send that context array back with your next request.

Streaming: Fast vs. Complete

  • stream: true (Default): The AI sends the answer piece by piece (token by token). This is how you get that "typing" effect in apps.

  • stream: false: The AI thinks in silence and sends one giant JSON object once it’s finished. Great for structured data (JSON) but feels slower to the user.

In React Native, the fetch call is straightforward, but the Android Emulator quirk is a trap. If you use localhost:11434, the app will look inside the phone and find nothing. You must use 10.0.2.2:11434 to reach your computer.

// The "Brain" of our Simple Chat App
const response = await fetch('http://10.0.2.2:11434/api/chat', {
  method: 'POST',
  body: JSON.stringify({
    model: 'gemma3',
    messages: [{ role: 'user', content: 'Hello!' }],
    stream: false,
  }),
});

spent an hour wondering why my api/generate response was so long. I realized the context array at the bottom of the response is just the "compressed memory" of the conversation. It’s not human-readable, but the model needs it to keep the conversation going!

AI for Mobile Developers: Learning Local LLMs

Part 7 of 7

AI for Mobile Developers: Learning Local LLMs is a public learning journey documenting how a React Native developer explores practical AI integration for mobile apps. This series focuses on understanding how Large Language Models work and how they can run directly on mobile devices using local inference. Instead of deep AI theory, the goal is to learn from a developer perspective — experimenting with tools, running models locally, and eventually integrating AI features inside mobile applications.

Start from the beginning

Learning AI as a Mobile Developer — Why I'm Exploring Local LLMs for Mobile Apps

Journey Context After 6.5 years of building react-native apps. Most of my work has been focused on building production apps, integrating APIs, optimizing performance, and shipping features. But instea