Запуск локальной модели

## Модели из ролика:

https://huggingface.co/NidAll/supergemma4-e4b-abliterated-Q4_K_M-GGUF

https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF

Когда заходишь на HuggingFace, в поиске ищешь модели с метками:

- `GGUF` — формат для локального запуска через Ollama, LM Studio, Kobold

| Суффикс | Значение | Когда использовать |

|---------|----------|-------------------|

| `UD-` | UpDown — квантизация с importance matrix, лучшее качество | Берём если есть выбор |

| `Q4_K_M` | 4 бита, K-quants medium | **Золотая середина для агентов** |

| `Q8_0` | 8 бит, без группировки | Максимум качества, но в 2× тяжелее |

| `Q4_K_S` | 4 бита, small | Слабое железо, быстрее загружается |

| `FP16` | 16 бит, оригинал | Только на GPU с 16+ GB VRAM |

Для агентной работы рекомендую **Q4_K_M** — это золотая середина.

---

## Часть 2. Установка Ollama и запуск модели

```bash

# macOS

curl -fsSL https://ollama.com/install.sh | sh

# Windows — скачиваем с ollama.com/download

```

Проверяем:

```bash

ollama --version

```

## Часть 3. Modelfile — секрет агентной работы

### Увеличиваем контекст 1

```bash

# Экспортируем текущий Modelfile

ollama show --modelfile {modelname} > /Users/daniel/myAI/tmp/{newmodelname}.Modelfile

```

Открываем файл и добавляем:

```

nano /Users/daniel/myAI/tmp/{newmodelname}.Modelfile

```

```text

- FROM /Users/daniel/.ollama/models/blobs/sha256-fc2ebf4c44528daa2cea7b39891712847ca5e4f87dcf578054a06c46bfe6da27

RENDERER gemma4

PARSER gemma4

PARAMETER num_ctx 16384

```

PARAMETER num_ctx 32768

```bash

# Создаём агентную версию

ollama create {newmodelname} -f /Users/daniel/myAI/tmp/{newmodelname}.Modelfile

```

### Проверяем tool use

Критично: модель должна уметь вызывать инструменты.

```bash

ollama show название_модели

```

Ищем в выводе `capabilities: tools`. Если нет — эта модель НЕ подойдёт для агента. Она будет только болтать.

## Часть 4. Claude Code и Goose — агентный харнес для локальных моделей

## Claude Code с локальной моделью

Кроме Goose, локалку можно запустить и в Claude Code. Команда — `ollama launch claude`, после неё указываешь модель, а всё, что идёт после двойного дефиса `--`, передаётся уже внутрь Claude Code.

**Базовый запуск:**

```bash

ollama launch claude --model rdhorner-gemma4-obliterated-textonly

```

**Bare mode** — минимальный режим без проектного контекста, MCP, skills и фоновых процессов. Лучше для диагностики и слабого железа:

```bash

ollama launch claude --model rdhorner-gemma4-obliterated-textonly -- --bare

```

**Ограничить инструменты** — например, только терминал и чтение файлов:

```bash

ollama launch claude --model rdhorner-gemma4-obliterated-textone -- --bare --tools Bash --tools Read --tools Edit

```

**Быстрый разовый промпт** без интерактивной сессии:

```bash

ollama launch claude --model rdhorner-gemma4-obliterated-textone -- --bare -p "ответь одним словом: привет"

```

Важно: модель должна уметь `tools`. Проверь через `ollama show <model>` — в Capabilities должно быть `tools`. Иначе Claude Code упадёт с ошибкой «does not support tools».

## Goose

### Установка

```bash

# macOS

brew install goose

# или через install script

curl -fsSL https://block.github.io/goose/install.sh | bash

```

### Конфигурация

```

goose configure

```

Goose использует переменные окружения или конфиг-файл.

**Быстрый старт без профиля:**

```bash

goose session --no-profile

```

**Автоматический режим:**

```bash

GOOSE_MODE=auto goose session --no-profile --with-builtin developer --max-turns 4

```

---

# Тесты локальных моделей

### Кодинг

```

Создай простой интерактивный консольный task tracker в текущей папке.

Структура должна быть ровно такой:

task_tracker.py

test_task_tracker.py

Используй только стандартную библиотеку Python.

Программа должна хранить задачи в файле tasks.json в текущей папке.

Формат tasks.json должен быть простым списком:

[

{

"id": 1,

"title": "Buy milk",

"done": false

}

]

В файле task_tracker.py реализуй:

1. dataclass Task:

- id: int

- title: str

- done: bool

2. Функции бизнес-логики:

- load_tasks(path: str = "tasks.json") -> list[Task]

- save_tasks(tasks: list[Task], path: str = "tasks.json") -> None

- add_task(title: str, path: str = "tasks.json") -> Task

- list_tasks(path: str = "tasks.json") -> list[Task]

- mark_done(task_id: int, path: str = "tasks.json") -> bool

- delete_task(task_id: int, path: str = "tasks.json") -> bool

Правила:

- Если tasks.json не существует, load_tasks возвращает пустой список.

- add_task назначает id как max существующих id + 1, либо 1 если задач нет.

- add_task сохраняет задачу в tasks.json и возвращает созданную Task.

- list_tasks возвращает список задач.

- mark_done ставит done=True задаче с нужным id, сохраняет файл и возвращает True.

- Если id не найден, mark_done возвращает False.

- delete_task удаляет задачу с нужным id, сохраняет файл и возвращает True.

- Если id не найден, delete_task возвращает False.

3. Интерактивное меню.

При запуске команды:

python3 task_tracker.py

программа должна показать меню:

Task Tracker

1. Add task

2. List tasks

3. Mark task as done

4. Delete task

5. Exit

Choose an option:

Поведение меню:

- Если пользователь выбирает 1:

- спросить: Task title:

- добавить задачу

- напечатать: Added task 1: Buy milk

- снова показать меню

- Если пользователь выбирает 2:

- если задач нет, напечатать: No tasks

- если задачи есть, напечатать:

1. [ ] Buy milk

2. [x] Read book

- снова показать меню

- Если пользователь выбирает 3:

- спросить: Task id:

- если задача найдена, напечатать: Marked task 1 as done

- если задача не найдена, напечатать: Task 1 not found

- если введен не номер, напечатать: Invalid task id

- снова показать меню

- Если пользователь выбирает 4:

- спросить: Task id:

- если задача найдена, напечатать: Deleted task 1

- если задача не найдена, напечатать: Task 1 not found

- если введен не номер, напечатать: Invalid task id

- снова показать меню

- Если пользователь выбирает 5:

- напечатать: Goodbye!

- завершить программу

- Если пользователь вводит неизвестный пункт:

- напечатать: Invalid option

- снова показать меню

4. Обязательно сделай функцию main().

В конце файла должно быть:

if __name__ == "__main__":

main()

В файле test_task_tracker.py напиши unittest-тесты только для бизнес-логики, не для интерактивного input:

1. test_load_missing_file

2. test_add_task

3. test_list_tasks

4. test_mark_done_existing_task

5. test_mark_done_missing_task

6. test_delete_existing_task

7. test_delete_missing_task

Важно:

- В тестах используй tempfile.TemporaryDirectory.

- В тестах передавай отдельный путь к json-файлу в функции через параметр path.

- Тесты не должны использовать настоящий tasks.json из текущей папки.

- Не тестируй input() и интерактивное меню через unittest.

После создания файлов запусти:

python3 -m unittest test_task_tracker.py

Если тесты не проходят:

1. Прочитай ошибку.

2. Исправь код.

3. Запусти тесты снова.

После успешных тестов вручную проверь интерактивную программу:

Запусти:

python3 task_tracker.py

Проверь сценарий:

1. Выбери 1 и добавь задачу Buy milk

2. Выбери 2 и убедись, что задача отображается

3. Выбери 3 и отметь задачу 1 выполненной

4. Выбери 2 и убедись, что задача отображается как [x]

5. Выбери 4 и удали задачу 1

6. Выбери 2 и убедись, что напечатано No tasks

7. Выбери 5 и убедись, что программа завершилась

В финальном ответе напиши только:

- список созданных файлов

- результат команды python3 -m unittest test_task_tracker.py

- краткий результат ручной проверки интерактивного меню

```

### skill - напиши пост в мой тг

https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/

```

напиши пост в мой тг на основе новости: ust a few weeks ago, we introduced Gemma 4, our most capable open models to date. With over 60 million downloads in just the first few weeks, Gemma 4 is delivering unprecedented intelligence-per-parameter to developer workstations, mobile devices and the cloud. Today, we are pushing efficiency even further. We’re releasing Multi-Token Prediction (MTP) drafters for the Gemma 4 family. By using a specialized speculative decoding architecture, these drafters deliver up to a 3x speedup without any degradation in output quality or reasoning logic. Gemma 4 (MTP) drafter speed ups Tokens-per-second speed increases, tested on hardware using LiteRT-LM, MLX, Hugging Face Transformers, and vLLM. Why speculative decoding? The technical reality is that standard LLM inference is memory-bandwidth bound, creating a significant latency bottleneck. The processor spends the majority of its time moving billions of parameters from VRAM to the compute units just to generate a single token. This leads to under-utilized compute and high latency, especially on consumer-grade hardware. Speculative decoding decouples token generation from verification. By pairing a heavy target model (e.g., Gemma 4 31B) with a lightweight drafter (the MTP model), we can utilize idle compute to “predict” several future tokens at once with the drafter in less time than it takes for the target model to process just one token. The target model then verifies all of these suggested tokens in parallel. How speculative decoding works Standard large language models generate text autoregressively, producing exactly one token at a time. While effective, this process dedicates the same amount of computation to predicting an obvious continuation (like predicting “words” after “Actions speak louder than…”) as it does to solving a complex logic puzzle. MTP mitigates this inefficiency through speculative decoding, a technique introduced by Google researchers in Fast Inference from Transformers via Speculative Decoding. If the target model agrees with the draft, it accepts the entire sequence in a single forward pass —and even generates an additional token of its own in the process. This means your application can output the full drafted sequence plus one token in the time it usually takes to generate a single one. Unlocking faster AI from the edge to the workstation For developers, inference speed is often the primary bottleneck for production deployment. Whether you are building coding assistants, autonomous agents that require rapid multi-step planning, or responsive mobile applications running entirely on-device, every millisecond matters. By pairing a Gemma 4 model with its corresponding drafter, developers can achieve: Improved responsiveness: Drastically reduce latency for near real-time chat, immersive voice applications and agentic workflows. Supercharged local development: Run our 26B MoE and 31B Dense models on personal computers and consumer GPUs with unprecedented speed, powering seamless, complex offline coding and agentic workflows. Enhanced on-device performance: Maximize the utility of our E2B and E4B models on edge devices by generating outputs faster, which in turn preserves valuable battery life. Zero quality degradation: Because the primary Gemma 4 model retains the final verification, you get identical frontier-class reasoning and accuracy, just delivered significantly faster. Gemma 4 26B on a NVIDIA RTX PRO 6000. Standard Inference (left) vs. MTP Drafter (right) in tokens per second. Same output quality, half the wait time. Where you can dive deeper into MTP drafters To make these MTP drafters exceptionally fast and accurate, we introduced several architectural enhancements under the hood. The draft models seamlessly utilize the target model's activations and share its KV cache, meaning they don't have to waste time recalculating context the larger model has already figured out. For our E2B and E4B edge models, where the final logit calculation becomes a big bottleneck, we even implemented an efficient clustering technique in the embedder to further accelerate generation. We've also been closely analyzing hardware-specific optimizations. For example, while the 26B mixture-of-experts model presents unique routing challenges at a batch size of 1 on Apple Silicon, processing multiple requests simultaneously (e.g., batch sizes of 4 to 8) unlocks up to a ~2.2x speedup locally. We see similar gains with Nvidia A100 when increasing batch size. Want to see the exact mechanics of how this works? We’ve published an in-depth technical explainer that unpacks the visual architecture, KV cache sharing and efficient embedders powering these drafters. How to get started The MTP drafters for the Gemma 4 family are available today under the same open-source Apache 2.0 license as Gemma 4. Read the documentation to learn how to use MTP with Gemma 4. You can download the model weights right now on Hugging Face, Kaggle, and start experimenting with faster inference with transformers, MLX, VLLM, SGLang, and Ollama or try them directly on Google AI Edge Gallery for Android or iOS. We can't wait to see how this newfound speed accelerates what you build next in the Gemmaverse.

```

### Brave MCP

```

найди 3 последних ролика с моего ютуб-канала "Продуктивный Совет" и сделай json файл в корне директории с ссылкой, названием, описанием, датой публикации

```