The "Local First" Approach: Testing AI with Ollama Before Going Mobile
I’m using Ollama and Google’s latest Gemma 3 model. My goal is simple: Understand the difference between the /api/generate and /api/chat endpoints and learn how to bridge my React Native app to a local AI server.
esting in the terminal with curl is the fastest way to debug an LLM.
Ollama is the engine that serves the model.
Gemma 3 is the brain we are using.
Chat vs. Generate: Which one do you need?
/api/chat(The Easy Way): You send an array of messages with roles (system,user,assistant). Ollama handles the formatting for you. It’s perfect for building a ChatGPT-style interface./api/generate(The Raw Way): You send a single prompt string. It returns acontextarray (a bunch of numbers). If you want the AI to remember the next question, you have to send thatcontextarray back with your next request.
Streaming: Fast vs. Complete
stream: true(Default): The AI sends the answer piece by piece (token by token). This is how you get that "typing" effect in apps.stream: false: The AI thinks in silence and sends one giant JSON object once it’s finished. Great for structured data (JSON) but feels slower to the user.
In React Native, the fetch call is straightforward, but the Android Emulator quirk is a trap. If you use localhost:11434, the app will look inside the phone and find nothing. You must use 10.0.2.2:11434 to reach your computer.
// The "Brain" of our Simple Chat App
const response = await fetch('http://10.0.2.2:11434/api/chat', {
method: 'POST',
body: JSON.stringify({
model: 'gemma3',
messages: [{ role: 'user', content: 'Hello!' }],
stream: false,
}),
});
spent an hour wondering why my api/generate response was so long. I realized the context array at the bottom of the response is just the "compressed memory" of the conversation. It’s not human-readable, but the model needs it to keep the conversation going!



