Чем занимается runtime golang

Concurrency in Go in the form of goroutines is a very convenient means for writing modern concurrent software, but how does your Go program run these goroutines efficiently?
In this post, we will peek under the hood to help you understand how the Go runtime scheduler implements this magic by looking into it from the design perspective and how to use it to interpret scheduler trace information from a Go program during performance debugging.

All of the engineering marvels have come out of need, So to understand why there is a need to have a go runtime scheduler and how it works lets time-travel back to the history of the operating system which will give us insight into problems, as without understanding the root of a problem, there is no hope of solving it. This is what history does.

History of Operating Systems

Single user (no OS).
Batch, uni programmed, run to completion.
Multi programmed

The purpose of multi-programmed was to overlap CPU and I/O.

Multiple batches and Time Sharing.

Multiple batches —

IBM OS/MFT (Multiprogramming with a Fixed number of Tasks)
IBM OS/MVT (Multiprogramming with a Variable number of Tasks) — Here each job gets just the amount of memory it needs. That is, the partitioning of memory changes as jobs enter and leave.

Time-sharing —

This is multiprogramming with rapid switching between jobs. Deciding when to switch and which jobs to switch to was called scheduling.

Currently, most of the operating system uses time-sharing scheduler.

But then what entity does these scheduler schedule?

Different program in execution (process) or
The basic unit of CPU utilization (threads) that exist as subsets of a process.

But these Entity switches comes at a cost.

Work Stealing

Другим аспектом планировщика является то, что это планировщик «кражи горутин». Это помогает в нескольких областях поддерживать эффективное планирование. Во-первых, последнее, что вам нужно, это M перейти в состояние ожидания, потому что, как только это произойдет, ОС переключит M из ядра с помощью контекста. Это означает, что P не может выполнить какую-либо работу, даже если есть Goroutine в работоспособном состоянии, пока M не переключится обратно на ядро. «Кража горутин» также помогает сбалансировать временные интервалы между всеми P, чтобы работа распределялась лучше и выполнялась более эффективно.

На рисунке у нас есть многопоточная программа Go с двумя P, обслуживающими четыре G каждый и один G в GRQ. Что произойдет, если один из P быстро обслуживает все свои G?

Далее у P1 больше нет горутин для выполнения. Но есть горутины в работоспособном состоянии, как в LRQ для P2, так и в GRQ. Это момент, когда P1 нужно украсть горутину.

Правила кражи горутины следующие. Весь код можно посмотреть в исходниках runtime.

Таким образом, основываясь на этих правилах, P1 должен проверить P2 на наличие горутин в своем LRQ и взять половину того, что он найдет.

Что произойдет, если P2 завершит обслуживание всех своих программ и у P1 ничего не останется в LRQ?

P2 завершил всю свою работу и теперь должен украсть горутины. Во-первых, он будет смотреть на LRQ P1, но не найдет никаких Goroutines. Далее он будет смотреть на GRQ. Там он найдет G9.

P2 крадет G9 из GRQ и начинает выполнять работу. Что хорошо во всей этой краже, так это то, что она позволяет M оставаться занятой и не бездействовать.

But now, we have multiple run queues.

Local Run queue
Global Run queue
Network Poller

From where we should run our next goroutine?

In Go, the poll order has been defined as follows.

Local Run queue
Global Run queue
Network Poller
Work Stealing

i.e first check local run queue, if empty check global run queue, then check Network Poller and at last do work Stealing. We have some overviews of 1,2,3, by now. Let’s look into the Work Stealing.

If the local work queue is empty, try “stealing work from a different queue”

Work stealing solves the problem when one thread has too much work to do and the other is just idle. In Go, work-stealing try to satisfy one the following condition if the local queue is empty.

pull work out from the global queue.
pull work from network poller
steal work from the other local queues

Simple Goroutine Tour.

Can you guess the output of the above code snippet?.

If we look at the one combination of output, Immediately we have two questions.

How did 11 goroutines run concurrently? Magic?
In What Order 11 goroutines ran?

And these two questions give us a problem

Goroutine

A goroutine is a lightweight thread (logically a thread of execution) managed by the Go runtime. To start a new goroutine running add go keyword before the function call go add(a, b)

Diving into the Go grammar

Now, lets take a closer look at the second step. The go.y file that contains the language grammar is a good starting point for investigating the Go compiler and the key to understanding the language syntax. The main part of this file consists of declarations, similar to the following:

In this declaration, the xfndcl and fundcl nodes are defined. The fundcl node can be in one of the two forms. The first form corresponds to the following language construct:

and the second one to this language construct:

The xfndcl node consists of the func keyword that is stored in the LFUNC constant, followed by the fndcl and fnbody nodes.

An important feature of Bison (or Yacc) grammar is that it allows for placing arbitrary C code next to each node definition. The code is executed every time a match for this node definition is found in the source code. Here, you can refer to the result node as $$ and to the child nodes as $1 , $2 , etc.

It is easier to understand this through an example. Note that the following code is a shortcut version of the actual code.

First, a new node is created, which contains type information for the function declaration. The $3 argument list and the $5 result list are referenced from this node. Then, the $$ result node is created. It stores the function name and the type node. As you can see, there can be no direct correspondence between definitions in the go.y file and the node structure.

Кооперативный планировщик

Как мы уже говорили в первом посте, планировщик ОС является вытесняющим планировщиком. По сути, это означает, что вы не можете предсказать, что планировщик собирается делать в любой момент времени. Ядро принимает решения и все недетерминировано. Приложения, работающие поверх операционной системы, не контролируют то, что происходит внутри ядра с планированием, если они не используют примитивы синхронизации, такие как атомарные инструкции и вызовы мьютекса.

Планировщик Go является частью среды исполнения Go, а среда исполнения Go встроена в ваше приложение. Это означает, что планировщик Go работает в пользовательском пространстве над ядром. Текущая реализация планировщика Go является не вытесняющим, а взаимодействующим планировщиком. Быть кооперативным планировщиком означает, что планировщику нужны четко определенные события в пространстве пользователя, которые происходят в безопасных точках кода для принятия решений по планированию.

Что хорошо в кооперативном планировщике Go, так это то, что он выглядит и чувствует себя упреждающим. Вы не можете предсказать, что собирается делать планировщик Go. Это связано с тем, что принятие решений для этого планировщика зависит не от разработчиков, а от времени выполнения Go. Важно думать о планировщике Go как об упреждающем планировщике, и, поскольку планировщик недетерминирован, это не слишком сложно.

Синхронные системные вызовы

Что происходит, когда горутина хочет сделать системный вызов, который не может быть выполнен асинхронно? В этом случае Network poller не может быть использован, и горутина, выполняющая системный вызов, заблокирует M. Это плохо, но не существует способа предотвратить это. Одним из примеров системного вызова, который не может быть выполнен асинхронно, являются системные вызовы на основе файлов. Если вы используете CGO, могут быть другие ситуации, когда вызов C-функций также блокирует M.

Операционная система Windows может выполнять асинхронные системные вызовы на основе файлов. Технически при работе в Windows можно использовать Network poller.

Давайте посмотрим, что происходит с синхронным системным вызовом (например, файловым вводом / выводом), который приведет к блокировке М. На рисунке показана наша базовая диаграмма планирования, но на этот раз G1 собирается сделать синхронный системный вызов, который заблокирует M1.

Далее на рисунке планировщик может определить, что G1 вызвал блокировку М. В этот момент планировщик отсоединяет M1 от P с блокирующим G1, все еще прикрепленным. Затем планировщик вводит новый M2 для обслуживания P. В этот момент G2 может быть выбрана из LRQ и включена в контекст M2. Если M уже существует из-за предыдущего обмена, этот переход происходит быстрее, чем необходимость создания нового M.

Следующим шагом завершается системный вызов блокировки, выполненный G1. В этот момент G1 может вернуться в LRQ и снова обслуживаться P. M1 затем уходит в сторону для будущего использования, если этот сценарий должен повториться.

Состояния Горутин

Точно так же, как потоки, у горутин есть те же три состояния высокого уровня. Они определяют роль, которую планировщик Go играет с любой горутиной. Горутина может находиться в одном из трех состояний: Ожидание, Готовность или Выполнение.

Ожидание: это означает, что горутина остановлена и ждет чего-то, чтобы продолжить. Это может происходить по таким причинам, как ожидание операционной системы (системные вызовы) или синхронизация вызовов (атомарные и мьютексные операции). Эти типы задержек являются основной причиной плохой производительности.

Готовность: это означает, что горутина хочет получить время, чтобы выполнить назначенные инструкции. Если у вас много горутин, которым нужно время, то горутине придется ждать дольше, чтобы получить время. Кроме того, индивидуальное количество времени, которое получает любая горутина, сокращено, поскольку больше горутин конкурируют за время. Этот тип задержки планирования также может быть причиной плохой производительности.

Выполнение: это означает, что горутина была помещена в M и выполняет свои инструкции. Работа, связанная с приложением, завершена. Это то, что все хотят.

Ваша программа запускается

Когда ваша программа Go запускается, ей присваивается логический процессор (P) для каждого виртуального ядра, определенного на хост-машине. Если у вас есть процессор с несколькими аппаратными потоками на физическое ядро (Hyper-Threading), каждый аппаратный поток будет представлен вашей программе, как виртуальное ядро. Чтобы лучше понять это, взгляните на системный отчет для моего MacBook Pro.

Вы можете видеть, что у меня один процессор с 4 физическими ядрами. В этом отчете не раскрывается количество аппаратных потоков на каждое физическое ядро. Процессор Intel Core i7 имеет технологию Hyper-Threading, что означает, что на физическое ядро приходится 2 аппаратных потока. Это сообщит программе Go, что доступно 8 виртуальных ядер для параллельного выполнения потоков ОС. Чтобы проверить это, рассмотрим следующую программу:

Каждой программе Go также дается начальный Goroutine (G). Goroutine — это, по сути, Coroutine, но это Go, поэтому мы заменяем букву C на G и получаем слово Goroutine. Вы можете думать о Goroutines как о потоках уровня приложения, и они во многом похожи на потоки ОС. Так же, как потоки ОС включаются и выключаются ядром, контекстные программы включаются и выключаются контекстом.

Последний пазл — это очереди на выполнение. В планировщике Go есть две разные очереди выполнения: глобальная очередь выполнения (GRQ) и локальная очередь выполнения (LRQ). Каждому P присваивается LRQ, который управляет горутинами, назначенными для выполнения в контексте P. Эти горутины по очереди включаются и выключаются из контекста M, назначенного для этого P. GRQ предназначен для горутин, которые не были назначены для P. Существует процесс, чтобы переместить горутины из GRQ в LRQ, который мы обсудим позже.

На рисунке представлено изображение всех этих компонентов вместе.

Blocked Goroutine during Channel Operation.

Each Channel has a recvq (waitq) that is used to store the blocked goroutines which are trying to read data from the channel.

Unblocked goroutine after channel operation is put into Run queue by the channel itself.

First, let’s look into the blocking system call. A system call that blocks the underlying kernel thread, so we cannot Schedule any other Goroutine on this thread.

Implication Blocking System Call reduces the Parallelism level.

Cannot Schedule any other Goroutine on M2 thread, resulting in CPU waste, as we have work to do but we are not running it.

The way we can restore the parallelism level is while we are entering into system call, we can wake up another thread, which will pick runnable goroutine from the run queue.

But now we have Oversubscribed Scheduling when the system call is finished. To avoid we will not run the Goroutine returning from blocking system call instantly. But we will put it into our scheduler run queue.

So the number of threads greater than number of cores when our program runs. Though not explicitly stated the number of threads is greater than the number of cores and all Idle threads are managed by runtime too, to avoid too many threads.

The initial setting is 10,000 threads, the program will crash if it exceeds.

Non-Blocking System Call — Blocks the goroutine on the Integrated runtime poller, and the thread is released to run another goroutine.

part implementation of net.Read function.

Once the first syscall is done and explicitly says the resource is not yet ready, the goroutine will park until the network poller notifies it that the resource is now ready. In this case, thread M will not be blocked.

Poller will use select / kqueue / epoll / IOCP based on the operating system to know which file descriptor is ready and will put the goroutine back on the run queue as soon as the file descriptor is ready for reading or writing.

There is also a Sysmon OS thread that will periodically poll network if it’s not polled for more than 10ms and will add the ready G to the queue.

Basically all of the goroutines blocked for operation on

Have some sort of queue, that helps in scheduling these goroutines.

Практический пример

Вещи на поверхности, кажется, не отличаются. Все те же изменения контекста и изменения состояния происходят независимо от того, используете ли вы Потоки или Горутины. Однако существует большая разница между использованием Потоков и Горутин, которая может быть неочевидна на первый взгляд.

В случае использования горутин одни и те же потоки ОС и ядро используются для всей обработки. Это означает, что с точки зрения ОС Поток ОС никогда не переходит в состояние ожидания; ни разу. В результате все те инструкции, которые мы потеряли при переключении контекста, при использовании потоков, не теряются при использовании горутин.

По сути, Go превратил работу IO / Blocking в работу с привязкой к процессору на уровне ОС. Поскольку все переключение контекста происходит на уровне приложения, мы не теряем те же самые ~ 12 тыс. инструкций (в среднем) на переключение контекста, которые мы теряли при использовании потоков. В Go те же переключатели контекста стоят вам ~ 200 наносекунд или ~ 2,4 тыс. команд. Планировщик также помогает повысить эффективность строк кэширования и NUMA. Вот почему нам не нужно больше потоков, чем у нас есть виртуальные ядра. В Go можно со временем выполнять больше работы, потому что планировщик Go пытается использовать меньше потоков и делать больше на каждом потоке, что помогает снизить нагрузку на ОС и оборудование.

Асинхронные системные вызовы

Когда операционная система, на которой вы работаете, имеет возможность обрабатывать системный вызов асинхронно, то, что называется network poller, может использоваться для более эффективной обработки системного вызова. Это достигается с помощью kqueue (MacOS), epoll (Linux) или iocp (Windows) в этих соответствующих ОС.

Сетевые системные вызовы могут обрабатываться асинхронно многими операционными системами, которые мы используем сегодня. Именно здесь network poller показывает себя, поскольку его основное назначение — обработка сетевых операций. Используя network poller для сетевых системных вызовов, планировщик может запретить горутинам блокировать M при выполнении этих системных вызовов. Это помогает держать M доступным для выполнения других горутин в LRQ P без необходимости создавать новые M. Это помогает уменьшить нагрузку планирования в ОС.

Лучший способ увидеть, как это работает — просмотреть пример. На рисунке показана наша базовая схема планирования. Горутина-1 выполняется на M, и еще 3 Горутины ждут в LRQ, чтобы получить свое время на M. Network poller бездействует, и ему нечего делать.

На следующем рисунке Горутина-1 (G1) хочет выполнить сетевой системный вызов, поэтому G1 перемещается в Network poller и обрабатывается как асинхронный сетевой системный вызов. Как только G1 перемещена в Network poller, M теперь доступен для выполнения другой горутины из LRQ. В этом случае Горутина-2 переключается на M.

На следующем рисунке системный сетевой вызов завершается асинхронным сетевым вызовом, и G1 перемещается обратно в LRQ для P. После того, как G1 может быть переключен обратно на M, код, связанный с Go, за который он отвечает может выполнить снова. Большой выигрыш в том, что для выполнения сетевых системных вызовов не требуется никаких дополнительных Ms. У Network poller есть поток ОС, и он обрабатывает через event loop.

With that Go, Runtime has a Scheduler with all of the required functionality.

It can handle Parallel Execution (Multiple threads).
Handles Blocking System call and network I/O.
Handles Blocking User level (on channel) calls.
Scalable.
Efficient.
Fair.

That offers massive concurrency and always tries to achieve maximum utilization, minimum latencies.

Now that we have a fair bit of idea about Go runtime scheduler in general, how can we use it?. Go provide us with a trace tool, scheduler trace especially to provide insights about the behavior and debug the scalability issues related to the goroutine scheduler.

Run your Go program with the GODEBUG=schedtrace=DURATION environment variable to enables the scheduler trace. (DURATION is the period of output in ms.)

Привет, Хабр! Это второй пост в серии из трех частей, которая даст представление о механике и семантике работы планировщика в Go. Этот пост посвящен планировщику Go.

В первой части этого цикла я объяснил аспекты планировщика операционной системы, которые, на мой взгляд, важны для понимания и оценки семантики планировщика Go. В этом посте я объясню на семантическом уровне, как работает планировщик Go. Планировщик Go — сложная система, и мелкие механические детали не важны. Важно иметь хорошую модель того, как все работает и ведет себя. Это позволит вам принимать лучшие инженерные решения.

Вывод

Планировщик Go действительно удивителен тем, как он учитывает тонкости работы ОС и оборудования. Возможность превратить работу ввода-вывода / блокировки в работу с процессором с привязкой к процессору на уровне операционной системы — это то, где мы получаем большой выигрыш в использовании большей мощности процессора с течением времени. Вот почему вам не нужно больше потоков ОС, чем у вас есть виртуальные ядра. Вы можете разумно ожидать, что вся ваша работа будет выполнена (с привязкой к процессору и вводу-выводу / блокировкам) с одним потоком ОС на виртуальное ядро. Это возможно для сетевых приложений и других приложений, которым не нужны системные вызовы, блокирующие потоки ОС.

Как разработчик, вы все равно должны понимать, что делает ваше приложение с точки зрения типа работы. Вы не можете создавать неограниченное количество горутин и ожидать потрясающую производительность. Меньше — всегда больше, но с пониманием этой семантики Go-планировщика вы можете принимать лучшие инженерные решения.

Среда исполнения (runtime) предоставляет пользователям статистику и отчеты о внутренних событиях для диагностики проблем производительности и использования на уровне времени исполнения.

Пользователи могут отслеживать эту статистику, чтобы лучше понять общее состояние и производительность программ Go. Некоторые часто отслеживаемые характеристики и состояния:

runtime.ReadMemStats сообщает о метриках, связанных с выделением кучи и сборкой мусора. Статистика памяти полезна для мониторинга того, сколько ресурсов памяти потребляет процесс, может ли процесс эффективно использовать память, и выявлять утечки памяти.
debug.ReadGCStats читает статистику о сборке мусора. Полезно посмотреть, сколько ресурсов расходуется на паузы в GC. Он также сообщает временную шкалу пауз сборщика мусора и процентили времени паузы.
debug.Stack возвращает текущую трассировку стека. Трассировка стека полезна, чтобы увидеть, сколько в настоящий момент запущено goroutines, что они делают, и заблокированы они или нет.
debug.WriteHeapDump приостанавливает выполнение всех goroutines и позволяет выгрузить кучу в файл. Дамп кучи - это снимок памяти процесса Go в данный момент времени. Он содержит все выделенные объекты, а также процедуры, финализаторы и многое другое.
runtime.NumGoroutine возвращает количество текущих goroutines. Это значение можно отслеживать, чтобы увидеть, достаточно ли используется goroutines, или для обнаружения утечек goroutines.

Трассировщик исполнения

Go поставляется с трассировщиком исполнения времени исполнения для захвата широкого спектра событий во время исполнения. Планирование, системный вызов, сборка мусора, размер кучи и другие события собираются во время исполнения и доступны для визуализации go tool trace. Трассировщик исполнения является инструментом для выявления задержек и проблем с использованием ресурсов. Вы можете проверить, насколько эффективно используется ЦП, и когда сетевые или системные вызовы являются причиной прерывания для goroutines.

Трассировщик полезен для:

Понимания, как выполняются ваши программы.
Ознакомления с некоторыми основными событиями среды исполнения, такими как GC.
Выявления плохо распараллеленного исполнения.

Тем не менее, он не подходит для выявления горячих точек, таких как анализ причин чрезмерного использования памяти или ЦП. Вместо этого сначала используйте инструменты профилирования для их решения.

Выше визуализация go tool trace показывает, что выполнение началось нормально, а затем оно стало сериализованным. Это предполагает, что может быть конфликт блокировки для общего ресурса, который создает узкое место.

GODEBUG

Среда исполнения также генерирует события и информацию, если переменная среды GODEBUG установлена соответствующим образом.

GODEBUG=gctrace=1 печатает события сборщика мусора в каждой коллекции, суммируя объем собранной памяти и продолжительность паузы.
GODEBUG=schedtrace=X печатает события планирования каждые X миллисекунд.

Переменная окружения GODEBUG может использоваться для отключения использования расширений набора команд в стандартной библиотеке и библиотеки времени исполнения (runtime).

В версии, предшествующей выпуску go 1.5 веб-сайта Tour of Go , есть фрагмент кода, который выглядит следующим образом.

Результат выглядит так:

Меня беспокоит то, что при runtime.Gosched() удалении программа перестает печатать "мир".

Почему это так? Как runtime.Gosched() влияет на исполнение?

Заметка:

Когда вы запускаете программу Go без указания переменной среды GOMAXPROCS, горутины Go планируются для выполнения в одном потоке ОС. Однако, чтобы программа выглядела многопоточной (для чего нужны горутины, не так ли?), Планировщик Go должен иногда переключать контекст выполнения, чтобы каждая горутина могла выполнять свою часть работы.

Как я уже сказал, когда переменная GOMAXPROCS не указана, среде выполнения Go разрешено использовать только один поток, поэтому невозможно переключать контексты выполнения, пока goroutine выполняет некоторую обычную работу, например вычисления или даже ввод-вывод (который отображается на простые функции C. ). Контекст может переключаться только тогда, когда используются примитивы параллелизма Go, например, когда вы включаете несколько каналов, или (это ваш случай), когда вы явно указываете планировщику переключать контексты - это то runtime.Gosched , для чего.

Короче говоря, когда контекст выполнения в одной горутине достигает Gosched вызова, планировщик получает указание переключить выполнение на другую горутину. В вашем случае есть две горутины, основная (которая представляет «основной» поток программы) и дополнительная, которую вы создали go say . Если вы удалите Gosched вызов, контекст выполнения никогда не будет перенесен из первой горутины во вторую, поэтому для вас не будет «мира». Когда Gosched присутствует, планировщик передает выполнение на каждой итерации цикла от первой горутины ко второй и наоборот, так что у вас чередуются «привет» и «мир».

К вашему сведению, это называется «кооперативная многозадачность»: горутины должны явно передавать управление другим горутинам. Подход, используемый в большинстве современных операционных систем, называется «вытесняющей многозадачностью»: потоки выполнения не связаны с передачей управления; планировщик прозрачно переключает им контексты выполнения. Кооперативный подход часто используется для реализации «зеленых потоков», то есть логических параллельных сопрограмм, которые не отображают 1: 1 в потоки ОС - так реализуется среда выполнения Go и ее горутины.

Обновить

Если для этой переменной установлено положительное число N , среда выполнения Go сможет создавать до N собственных потоков, для которых будут запланированы все зеленые потоки. Родной поток - это своего рода поток, который создается операционной системой (потоки Windows, pthreads и т. Д.). Это означает, что если N больше 1, возможно, что горутины будут запланированы для выполнения в разных собственных потоках и, следовательно, будут выполняться параллельно (по крайней мере, с учетом возможностей вашего компьютера: если ваша система основана на многоядерном процессоре, она вполне вероятно, что эти потоки будут по-настоящему параллельными; если ваш процессор имеет одноядерный процессор, то вытесняющая многозадачность, реализованная в потоках ОС, создаст видимость параллельного выполнения).

Можно установить переменную GOMAXPROCS с помощью runtime.GOMAXPROCS() функции вместо предварительной установки переменной среды. Используйте в своей программе что-то подобное вместо текущего main :

В этом случае можно наблюдать интересные результаты. Возможно, что вы получите напечатанные строки hello и world с неравномерным чередованием, например

Это может произойти, если горутины запланированы для разделения потоков ОС. Фактически, именно так работает вытесняющая многозадачность (или параллельная обработка в случае многоядерных систем): потоки параллельны, а их комбинированный вывод является неопределенным. Кстати, вы можете оставить или удалить Gosched вызов, похоже, это не действует, когда GOMAXPROCS больше 1.

Вот что я получил при нескольких запусках программы с runtime.GOMAXPROCS call.

Видите ли, иногда результат бывает красивым, иногда нет. Индетерминизм в действии :)

Еще одно обновление

Похоже, что в более новых версиях компилятора Go среда выполнения Go заставляет горутины уступать не только при использовании примитивов параллелизма, но и при системных вызовах ОС. Это означает, что контекст выполнения можно переключать между горутинами также при вызове функций ввода-вывода. Следовательно, в последних компиляторах Go можно наблюдать недетерминированное поведение, даже когда GOMAXPROCS не установлен или установлен в 1.

This blog post explains the structure of a Go-based project, dives into the Go compiler, and overviews a Go node.

This series of blog posts is intended for those who are already familiar with the basics of Go and would like to get a deeper insight into its internals. This tutorial is dedicated to the structure of the Go source code and some internal details of the Go compiler. After reading this, you should be able to answer the following questions:

What is the structure of the Go source code?
How does the Go compiler work?
What is the basic structure of a node tree in Go?

What happens Once blocking syscall exits?

Runtime tries to acquire the exact same P, and resume the execution.
Runtime tries to acquire a P in the idle list and resume the execution.
Runtime put the goroutine in the global queue and put the associated M back to the idle list.

Spinning Thread and Ideal Thread.

When M2 thread becomes ideal after syscall returns. What to do with this ideal M2 thread. Theoretically, a thread should be destroyed by the OS if it finishes what it needs to do, and then threads in other processes may be scheduled for execution by the CPU. This is what we often call “preemptive scheduling” of threads in an operating system.

Consider the situation in the syscall above. If we destroy the M2 thread and M3 thread is about to enter into the syscall. At this point, the runnable goroutines cannot be processed until a new Kernel Thread is created and is scheduled to be executed by the OS. Frequent pre-thread preemption operations not only increase the load on the OS but are almost unacceptable for programs with higher performance requirements.

So to properly have the resource utilization of the OS and prevent the frequent thread preemption of the load on the OS, we will not destroy the Kernel Thread M2, instead, it will take a spin operation and save itself for further use. Although it seems that this is a waste of some resources. But when compared with frequent preemption between threads and frequent create and destroy operation “ideal thread” is still very less price we will pay.

Spinning Thread — For example, in a Go program with One kernel thread M (1) and One logical processor (P), if the M being executed is being blocked by the syscall, then the same number of “Spinning Threads” as the number of P is required to allow the waiting runnable goroutine to continue executing. Therefore, during this period, the number of kernel threads M is more than the number of P (a Spinning Thread + a blocked thread). So even when runtime.GOMAXPROCS value is set to 1, the program will be in a multi-threaded state.

Like many other schedulers, Go too has a fairness constraint and is imposed by the implementation on the goroutines because Runnable goroutine should run eventually

Here are four typical fairness constraints in Go Runtime Scheduler.

Any goroutine running for more than 10ms is marked as preemptible (soft limit). But, the preemption is only done at the function prolog. Go currently uses compiler-inserted cooperative preemption points in function prologues.

Infinite loop — preemption (~10ms time slice) — soft limit

But be cautious with an infinite loop as Go’s scheduler is not preemptive (till 1.13). If loops don’t contain any preemption points (like function calls, or allocate memory), they will prevent other goroutines from running. A simple example is:

If you run with:

It may never print the statement until Go(1.13). Due to a lack of preemption points, main Goroutines can hog the processor.

Local Run queue — preemption (~10ms time slice) — soft limit
Global run queue starvation is avoided by checking the global run queue for every 61 scheduler tick.
Network Poller Starvation Background thread poll network occasionally if not polled by the main worker thread.

Now runtime has a Scheduler with the following functionality.

It can handle Parallel Execution (Multiple threads).
Handles Blocking System call and network I/O.
Handles Blocking User level (on channel) calls.

As you can see we have a Global run queue with Mutex, we will end up with some Issues like

Overhead of cache coherency guarantees.
Fierce Lock Contention while creating, destroying and scheduling Goroutine G.

Overcoming the Scalability problem with Distributed Scheduler.

With this, the immediate benefits we can see is we have now no mutex for each thread-local run queue. Still have a global run queue with a mutex, used in special cases. It doesn’t affect scalability.

Scheduling Cost.

So it’s more efficient to use one process that contains multiple threads since process creation is time-consuming and resource-intensive. But then Multithreaded problem, appeared: The C10k Problem being the major one.

For example, if you define the scheduler period as 10ms (milliseconds) and you have 2 threads, each thread will get 5ms separately. If you have 5 threads, then each thread gets 2ms. But what happens if there are 1000 threads? Give each thread a time slice of 10 μs (microseconds)? Wrong, it’s stupid to do this, because you will spend a lot of time on context switching, but the real work cannot be done.

You need to limit the length of the time slice. In the last scenario, if the minimum time slice is 2ms and there are 1000 threads, the scheduler cycle needs to be increased to 2s (seconds). If there are 10,000 threads, the scheduler cycle is 20 sec. In this simple example, if each thread uses its full-time slice, it takes 20 sec for all threads to run at once. So we need something that can make our concurrency cheaper, without suffering from too much overhead.

Inside the Go compiler

As we mentioned above, the architecture-independent part of the Go compiler is located in the /src/cmd/gc/ folder. The entry point is located in the lex.c file. Apart from some common stuff, such as parsing command line arguments, the compiler does the following:

Initializes some common data structures.
Iterates through all of the provided Go files and calls the yyparse method for each file. This causes actual parsing to occur. The Go compiler uses Bison as the parser generator. The grammar for the language is fully described in the go.y file (I will provide more details on it later). As a result, this step generates a complete parse tree where each node represents an element of the compiled program.
Recursively iterates through the generated tree several times and applies some modifications, e.g., defines type information for the nodes that should be implicitly typed, rewrites some language elements—such as typecasting—into calls to some functions in the runtime package and does some other work.
Performs the actual compilation after the parse tree is complete. Nodes are translated into assembler code.
Creates the object file that contains generated assembly code with some additional data structures, such as the symbols table, which is generated and written to the disk.

Understanding a project structure

If you look at the /src folder of the Go repository, you can see a lot of folders. Most of them contain source files of the standard Go library. The standard naming conventions are always applied here, so each package is inside a folder with a name that directly corresponds to the package name. Apart from the standard library, there is a lot of other stuff. In our opinion, the most important and useful folders are listed in the table below.

Folder	Description
/src/cmd/	Contains different command line tools.
/src/cmd/go/	Contains source files of a Go tool that downloads and builds Go source files and installs packages. While doing this, it collects all source files and makes calls to the Go linker and Go compiler command line tools.
/src/cmd/dist/	Contains a tool responsible for building all other command line tools and all the packages from the standard library. You may want to analyze its source code to understand what libraries are used in every particular tool or package.
/src/cmd/gc/	This is the architecture-independent part of the Go compiler.
/src/cmd/ld/	The architecture-independent part of the Go linker. Architecture-dependent parts are located in the folder with the “l” postfix that uses the same naming conventions as the compiler.
/src/cmd/5a/, 6a, 8a, and 9a	Here you can find Go assembler compilers for different architectures. The Go assembler is a form of assembly language that does not map precisely to the assembler of the underlying machine. Instead, there is a distinct compiler for each architecture that translates the Go assembler to the machine’s assembler. You can find more details in the official documentation.
/src/lib9/, /src/libbio, /src/liblink	Different libraries that are used inside the compiler, linker, and runtime package.
/src/runtime/	The most important Go package that is indirectly included into all programs. It contains the entire runtime functionality, such as memory management, garbage collection, goroutines creation, etc.

Simple M:N Scheduler

In our simple M:N Scheduler we have a global run queue, Some operation puts a new goroutine into the run queue. M Kernel threads access the scheduler to gets goroutine to run from “run queue”. Multiple threads are attempting to access the same area of memory, we will lock this structure with Mutex For Memory Access Synchronization.

Some instances when a goroutine can block.

Sending and Receiving on Channel.
Network I/O.
Blocking System Call.
Timers.
Mutexes.

So where do we put these blocked goroutines? — The design decision where to put these blocked goroutines basically revolves around one fundamental principle-

Blocked goroutine should not block the underlying kernel thread! (to avoid the thread context switch cost )

User Level Threads

Threads managed entirely by the run-time system (user-level library).
Ideally, Fast and efficient: switching threads not much more expensive than a function call.
The kernel knows nothing about user-level threads and manages them as if they were single-threaded processes.

In Go, we know it by the name of “Goroutine” (logically)

Understanding nodes

Now it is time to take a look at what a node actually is. First of all, a node is a struct (you can find a definition here). This struct contains a large number of properties, since it needs to support different kinds of nodes and different nodes have different attributes. Below is a description of several fields that I think are important to understand.

Node struct field	Description
op	Node operation. Each node has this field. It distinguishes different kinds of nodes from each other. In our previous example, those were OTFUNC (operation type function) and ODCLFUNC (operation declaration function).
type	This is a reference to another struct with type information for nodes that have type information (there are no types for some nodes, e.g., control flow statements, such as if , switch , or for ).
val	This field contains the actual values for nodes that represent literals.

Now that you understand the basic structure of the node tree, you can put your knowledge into practice. In the next post, we will investigate what exactly the Go compiler generates, using a simple Go application as an example.

Problem Outline.

How to distribute these multiple goroutines over multiple OS threads that run on the available CPU processors.
In what order these multiple goroutines should run to maintain fairness?

The rest of the discussion will mostly around solving these problems specific with Go runtime scheduler from a design perspective. But as with all problems, our domain also needs a well-defined boundary to deal with. Otherwise, the problem statement can be too vague for conclusive discussion. A scheduler may aim at one or more of many goals, for our case we will limit ourselves to the following requirement.

Should be Parallel and Scalable and Fair.
Should be Scalable to millions of goroutine per process (10⁶)
Memory Efficient. (RAM is cheap, but not free.)
System calls should not cause performance degradation. (maximizing throughput, minimizing wait time)

So let’s start modeling our scheduler to solve these problems in incremental steps.

— User Level Threading.

Parallel and Scalable.
* Parallel (yes)
* Scalable (Not really)
Is Not Scalable to millions of goroutine per process (10⁶).

— Hybrid Threading

M kernel threads to execute N “goroutine”

A kernel thread is needed for the actual execution of code and parallelism. But it’s expensive to create, So we map N goroutines to M Kernel Thread. Goroutine is the Go Code, so we have full control over it. Also, it’s in the user-space so it is cheap to create.

But as OS doesn’t know anything about the goroutine. Every goroutine has a state to help Scheduler knows which goroutine to run based on goroutine state. This state information is small as compared to the kernel threads, the context switching of goroutine becomes very fast.

Running — goroutine currently running on kernel thread.
Runnable — goroutine waiting for kernel thread to run.
Blocked — Goroutines waiting for some conditions (e.g. blocked on a channel, syscall, mutex, etc.)

So a Go Runtime Scheduler manages these goroutines at various states, by Multiplexing N Goroutine to M Kernel Thread.

By now Go runtime has a Scheduler with the following functionality.

It can handle Parallel Execution (Multiple threads).
Handles Blocking System call and network I/O.
Handles Blocking User level (on channel) calls.
Scalable.

Remember the way we restored the parallelism level in the blocking system call?

And its implication is we can have multiple kernel thread (Can be 10 or 1000) in a system call, which can be greater the number of cores. We end up with a Constant overhead during:

Overcoming the efficiency problem with M:P:N Threading.

P — represents the processor, which can be seen as a local scheduler running on a thread;

Logical Process P is always fixed in number. (default to logical CPUs usable by the current process)

And we put our local run queue (LRQ) inside the fixed number of logical Processors (P).

Go runtime will first create the fixed number of logical processor P based on the number of logical CPUs of the machine (or as requested).

And each goroutine (G) will run on an OS thread (M) that is assigned to a logical CPU (P).

So Now we have No Constant overhead during:

Work stealing — just have to scan a fixed number of a logical processor (P) local run queue.
Garbage Collection, Memory allocator also gain the same benefits.

Go optimizes the system calls — whatever it is blocking or not — by wrapping them up in the runtime.

The Blocking SYSCALL method is encapsulated between runtime.entersyscall(SB)
runtime.exitsyscall(SB)
In a literal sense, some logic is executed before entering the system call, and some logic is executed after exiting the system call. This wrapper will automatically dissociate the P from the thread M when a blocking system call is made and allow another thread to run on it.

This allows Go runtime to handle the blocking system call efficiently without increasing the run queue.

Getting started

When you start learning a new programming language, you can usually find a lot of “hello-world” tutorials, beginner guides, and books with details on the main language concepts, syntax, and even the standard library. However, getting information on such things as the layout of major data structures that the language runtime allocates or what assembly code is generated when you call a built-in function is not that easy. Obviously, the answers lie inside the source code, but, from our own experience, you can spend hours wandering through it without making much progress. The goal of this article is to demonstrate how you can decipher Go sources on your own.

Before we can begin, we certainly need our own copy of the Go source files. There is nothing special in getting them. Just execute the command below.

Please note that the code in the main branch is being constantly changed, so we use the release-branch.go1.4 branch in this blog post.

Переключение контекста

Планировщик Go требует четко определенных событий в пространстве пользователя, которые происходят в безопасных точках кода, для переключения контекста. Эти события и безопасные точки проявляются в вызовах функций. Вызовы функций имеют решающее значение для работоспособности планировщика Go. Если вы выполняете какие-либо узкие циклы, которые не выполняют вызовы функций, вы будете вызывать задержки в планировщике и сборке мусора. Крайне важно, чтобы вызовы функций происходили в разумные сроки.

Существует четыре класса событий, которые происходят в ваших программах Go, которые позволяют планировщику принимать решения по планированию. Это не значит, что это всегда будет происходить на одном из этих событий. Это означает, что планировщик получает возможность.

Использование ключевого слова go
Cборщик мусора
Системные вызовы
Синхронизация

Cборщик мусора (GC)
Так как GC работает с собственным набором горутин, эти горутины нуждаются во времени на М для запуска. Это заставляет GC создавать много хаоса в планировании. Тем не менее, планировщик очень умен в том, что делает горутина, и он будет использовать это для принятия решений. Одним из разумных решений является переключение контекста на горутину, которая хочет обратиться к системному ресурсу, и никто больше кроме нее, во время сборки мусора. Когда GC работает, принимается много решений по планированию.

Системные вызовы
Если горутина делает системный вызов, который заставит ее заблокировать M, планировщик может переключать контекст на другую горутину, на тот же M.

Синхронизация
Если вызов атомарной операции, мьютекса или канала вызовет блокировку горутины, планировщик может переключить контекст для запуска новой горутины. Как только горутина может работать снова, она может быть поставлена в очередь и в конечном итоге переключиться обратно на M.

Читайте также: