Golang程序启动

使用环境

系统架构	语言版本
Amd64	Go 1.8

main函数

main函数是程序的入口点，它负责初始化程序并执行主要的逻辑。在Go语言中，main函数是特殊的，因为它会在程序启动时以goroutine（main goroutine ）形式通过调度来执行它。该goroutine被调度在一组特定的gmp模型上，通常是g0、p0、m0组成。如果把main函数作为程序运行的分界线，它们并不是启动main函数之后运行时创建的，这组特定的调度模型是在Go程序准备调度main goroutine 之前有汇编代码进行初始化完成的，专门为调度main函数准备。如果main函数中存在其他的goroutine，那么它们将调度在运行时期gmp上。

main函数为什么要以goroutine形式来执行

通过将main函数以goroutine的形式运行，Go语言可以实现并发执行和异步操作。这意味着在main函数中定义的函数或协程可以在不同的gmp模型中以goroutine形式并发地执行，从而实现并发编程。

汇编代码

阶段一初始化g0 m0

该阶段对g0、m0进行初始化，g0与m0相互引用

scss 复制代码

TEXT runtime·rt0_go(SB),NOSPLIT|TOPFRAME,$0
   // copy arguments forward on an even stack
   MOVQ   DI, AX    // argc
   MOVQ   SI, BX    // argv
   SUBQ   $(5*8), SP    // 3args 2auto
   ANDQ   $~15, SP
   MOVQ   AX, 24(SP)
   MOVQ   BX, 32(SP)

   // 这里初始化g0，为g0分配栈空间
   // create istack out of the given (operating system) stack.
   // _cgo_init may update stackguard.
   MOVQ   $runtime·g0(SB), DI
   LEAQ   (-64*1024+104)(SP), BX
   MOVQ   BX, g_stackguard0(DI)
   MOVQ   BX, g_stackguard1(DI)
   MOVQ   BX, (g_stack+stack_lo)(DI)
   MOVQ   SP, (g_stack+stack_hi)(DI)

  //TODO JMP 检测CPU代码
  ...
  ...
  ...
  //TODO JMP 检测CPU代码

  //初始化m的tls di = &m0.tls 把m0的tls成员地址存储到DI
  LEAQ    runtime·m0+m_tls(SB), DI
  //TODO 调用settls 设置线程本地存储 之后可以通过fs端寄存器进行访问 找到m0.tls DI绑定
  CALL    runtime·settls(SB)
  // store through it, to make sure it works
  //TODO 获取fs段基址 并放入BX寄存器中 其实就是m0.tls[1]的地址 ps: get_tls为编译器生成代码
  get_tls(BX)
  //TODO m0.tls[0] g是编译器实现 地址-8
  MOVQ    $0x123, g(BX)
  //TODO AX = m0.tls[0]
  MOVQ    runtime·m0+m_tls(SB), AX //CQS 检测线程本地存储是否初始化成功
  CMPQ    AX, $0x123
  //TODO 跳跃两个地址指令
  JEQ 2(PC)
  CALL    runtime·abort(SB)
ok:
  // set the per-goroutine and per-mach "registers"
  //TODO fs段基址放到bx m0.tls[1]
  get_tls(BX)
  //TODO g0的地址放到CX
  LEAQ    runtime·g0(SB), CX
  //TODO m0.tls[0] = &g0 地址赋值
  MOVQ    CX, g(BX)
  //TODO 把m0的地址放到AX
  LEAQ    runtime·m0(SB), AX
  //这里g0与m0之前相互引用
  // save m->g0 = g0
  //TODO m0.g0 = &g0
  MOVQ    CX, m_g0(AX)
  // save m0 to g0->m
  //TODO g0.m0 = &m0
  MOVQ    AX, g_m(CX)
  //CQS fs ==> tls[1] ==> g() ==> tls[0] ==> g0 ==> g0.m0 = &m0 ==> m0.g0 = &g0
  CLD             // convention is D is always left cleared


  //TODO 
  ...
  ...
  ...
  //TODO

阶段二构建环境

初始化运行环境，系统启动参数、全局变量 ncpu（cpu核心数）初始化p、堆内存分配、栈内存分配、垃圾回收器初始化

scss 复制代码

   CALL   runtime·check(SB)
   MOVL   24(SP), AX    // copy argc
   MOVL   AX, 0(SP)
   MOVQ   32(SP), AX    // copy argv
   MOVQ   AX, 8(SP)
   //TODO 参数初始化 栈空余的16利用
   CALL    runtime·args(SB)
   //TODO 初始化系统核心数
   CALL    runtime·osinit(SB)
   //TODO 开始初始化调度器
   CALL    runtime·schedinit(SB)

runtime·schedinit

scss 复制代码

// The bootstrap sequence is:
//
//  call osinit
//  call schedinit
//  make & queue new G
//  call runtime·mstart
//
// The new G calls runtime·main.
func schedinit() {
  lockInit(&sched.lock, lockRankSched)
  lockInit(&sched.sysmonlock, lockRankSysmon)
  lockInit(&sched.deferlock, lockRankDefer)
  lockInit(&sched.sudoglock, lockRankSudog)
  lockInit(&deadlock, lockRankDeadlock)
  lockInit(&paniclk, lockRankPanic)
  lockInit(&allglock, lockRankAllg)
  lockInit(&allpLock, lockRankAllp)
  lockInit(&reflectOffs.lock, lockRankReflectOffs)
  lockInit(&finlock, lockRankFin)
  lockInit(&trace.bufLock, lockRankTraceBuf)
  lockInit(&trace.stringsLock, lockRankTraceStrings)
  lockInit(&trace.lock, lockRankTrace)
  lockInit(&cpuprof.lock, lockRankCpuprof)
  lockInit(&trace.stackTab.lock, lockRankTraceStackTab)
  // Enforce that this lock is always a leaf lock.
  // All of this lock's critical sections should be
  // extremely short.
  lockInit(&memstats.heapStats.noPLock, lockRankLeafRank)

  // raceinit must be the first call to race detector.
  // In particular, it must be done before mallocinit below calls racemapshadow.
  _g_ := getg()
  if raceenabled {
    _g_.racectx, raceprocctx0 = raceinit()
  }

  sched.maxmcount = 10000

  // The world starts stopped.
  worldStopped()

  moduledataverify()
  stackinit()
  mallocinit()
  cpuinit()      // must run before alginit
  alginit()      // maps, hash, fastrand must not be used before this call
  fastrandinit() // must run before mcommoninit
  mcommoninit(_g_.m, -1)
  modulesinit()   // provides activeModules
  typelinksinit() // uses maps, activeModules
  itabsinit()     // uses activeModules
  stkobjinit()    // must run before GC starts

  sigsave(&_g_.m.sigmask)
  initSigmask = _g_.m.sigmask

  if offset := unsafe.Offsetof(sched.timeToRun); offset%8 != 0 {
    println(offset)
    throw("sched.timeToRun not aligned to 8 bytes")
  }

  goargs()
  goenvs()
  parsedebugvars()
  gcinit()

  lock(&sched.lock)
  sched.lastpoll = uint64(nanotime())
  procs := ncpu
  if n, ok := atoi32(gogetenv("GOMAXPROCS")); ok && n > 0 {
    procs = n
  }
  if procresize(procs) != nil {
    throw("unknown runnable goroutine during bootstrap")
  }
  unlock(&sched.lock)

  // World is effectively started now, as P's can run.
  worldStarted()

  // For cgocheck > 1, we turn on the write barrier at all times
  // and check all pointer writes. We can't do this until after
  // procresize because the write barrier needs a P.
  if debug.cgocheck > 1 {
    writeBarrier.cgo = true
    writeBarrier.enabled = true
    for _, p := range allp {
      p.wbBuf.reset()
    }
  }

  if buildVersion == "" {
    // Condition should never trigger. This code just serves
    // to ensure runtime·buildVersion is kept in the resulting binary.
    buildVersion = "unknown"
  }
  if len(modinfo) == 1 {
    // Condition should never trigger. This code just serves
    // to ensure runtime·modinfo is kept in the resulting binary.
    modinfo = ""
  }
}

// ---------------------------------------初始化p----------------------------------------------------------------------
// Change number of processors.
//
// sched.lock must be held, and the world must be stopped.
//
// gcworkbufs must not be being modified by either the GC or the write barrier
// code, so the GC must not be running if the number of Ps actually changes.
//
// Returns list of Ps with local work, they need to be scheduled by the caller.
func procresize(nprocs int32) *p {
  assertLockHeld(&sched.lock)
  assertWorldStopped()

  old := gomaxprocs
  if old < 0 || nprocs <= 0 {
    throw("procresize: invalid arg")
  }
  if trace.enabled {
    traceGomaxprocs(nprocs)
  }

  // update statistics
  now := nanotime()
  if sched.procresizetime != 0 {
    sched.totaltime += int64(old) * (now - sched.procresizetime)
  }
  sched.procresizetime = now

  maskWords := (nprocs + 31) / 32

  // Grow allp if necessary.
  if nprocs > int32(len(allp)) {
    // Synchronize with retake, which could be running
    // concurrently since it doesn't run on a P.
    lock(&allpLock)
    if nprocs <= int32(cap(allp)) {
      allp = allp[:nprocs]
    } else {
      nallp := make([]*p, nprocs)
      // Copy everything up to allp's cap so we
      // never lose old allocated Ps.
      copy(nallp, allp[:cap(allp)])
      allp = nallp
    }

    if maskWords <= int32(cap(idlepMask)) {
      idlepMask = idlepMask[:maskWords]
      timerpMask = timerpMask[:maskWords]
    } else {
      nidlepMask := make([]uint32, maskWords)
      // No need to copy beyond len, old Ps are irrelevant.
      copy(nidlepMask, idlepMask)
      idlepMask = nidlepMask

      ntimerpMask := make([]uint32, maskWords)
      copy(ntimerpMask, timerpMask)
      timerpMask = ntimerpMask
    }
    unlock(&allpLock)
  }

  // initialize new P's
  for i := old; i < nprocs; i++ {
    pp := allp[i]
    if pp == nil {
      pp = new(p)
    }
    pp.init(i)
    atomicstorep(unsafe.Pointer(&allp[i]), unsafe.Pointer(pp))
  }

  _g_ := getg()
  if _g_.m.p != 0 && _g_.m.p.ptr().id < nprocs {
    // continue to use the current P
    _g_.m.p.ptr().status = _Prunning
    _g_.m.p.ptr().mcache.prepareForSweep()
  } else {
    // release the current P and acquire allp[0].
    //
    // We must do this before destroying our current P
    // because p.destroy itself has write barriers, so we
    // need to do that from a valid P.
    if _g_.m.p != 0 {
      if trace.enabled {
        // Pretend that we were descheduled
        // and then scheduled again to keep
        // the trace sane.
        traceGoSched()
        traceProcStop(_g_.m.p.ptr())
      }
      _g_.m.p.ptr().m = 0
    }
    _g_.m.p = 0
    p := allp[0]
    p.m = 0
    p.status = _Pidle
    acquirep(p)
    if trace.enabled {
      traceGoStart()
    }
  }

  // g.m.p is now set, so we no longer need mcache0 for bootstrapping.
  mcache0 = nil

  // release resources from unused P's
  for i := nprocs; i < old; i++ {
    p := allp[i]
    p.destroy()
    // can't free P itself because it can be referenced by an M in syscall
  }

  // Trim allp.
  if int32(len(allp)) != nprocs {
    lock(&allpLock)
    allp = allp[:nprocs]
    idlepMask = idlepMask[:maskWords]
    timerpMask = timerpMask[:maskWords]
    unlock(&allpLock)
  }

  var runnablePs *p
  for i := nprocs - 1; i >= 0; i-- {
    p := allp[i]
    if _g_.m.p.ptr() == p {
      continue
    }
    p.status = _Pidle
    if runqempty(p) {
      pidleput(p)
    } else {
      p.m.set(mget())
      p.link.set(runnablePs)
      runnablePs = p
    }
  }
  stealOrder.reset(uint32(nprocs))
  var int32p *int32 = &gomaxprocs // make compiler check that gomaxprocs is an int32
  atomic.Store((*uint32)(unsafe.Pointer(int32p)), uint32(nprocs))
  return runnablePs
}

阶段三启动main

创建一个g，并加入队列中，开始调度

scss 复制代码

  // create a new goroutine to start program
   MOVQ   $runtime·mainPC(SB), AX       // entry
   PUSHQ  AX
  // 
   CALL   runtime·newproc(SB)
   POPQ   AX
  // start this M
   CALL   runtime·mstart(SB)
   CALL   runtime·abort(SB)  // mstart should never return
   RET

runtime·newproc

scss 复制代码

// 创建goroutine系统调用newproc
// Create a new g running fn.
// Put it on the queue of g's waiting to run.
// The compiler turns a go statement into a call to this.
func newproc(fn *funcval) {
  gp := getg()
  pc := getcallerpc()
  systemstack(func() {
    // 返回一个newg
    newg := newproc1(fn, gp, pc)
    // 获取p 并把newg 加入本地队列中去
    _p_ := getg().m.p.ptr()
    runqput(_p_, newg, true)
    if mainStarted {
      wakep()
    }
  })
}

// Create a new g in state _Grunnable, starting at fn. callerpc is the
// address of the go statement that created this. The caller is responsible
// for adding the new g to the scheduler. 添加一个newg 进行调度
func newproc1(fn *funcval, callergp *g, callerpc uintptr) *g {
  _g_ := getg()

  if fn == nil {
    _g_.m.throwing = -1 // do not dump full stacks
    throw("go of nil func value")
  }
  acquirem() // disable preemption because it can be holding p in a local var

  _p_ := _g_.m.p.ptr()
  newg := gfget(_p_)
  // 获取一个newg 若果等于nil 那么创建一个新的，设置栈大小2kb
  if newg == nil {
    newg = malg(_StackMin)
    casgstatus(newg, _Gidle, _Gdead)
    allgadd(newg) // publishes with a g->status of Gdead so GC scanner doesn't look at uninitialized stack.
  }
  if newg.stack.hi == 0 {
    throw("newproc1: newg missing stack")
  }

  if readgstatus(newg) != _Gdead {
    throw("newproc1: new g is not Gdead")
  }

  totalSize := uintptr(4*goarch.PtrSize + sys.MinFrameSize) // extra space in case of reads slightly beyond frame
  totalSize = alignUp(totalSize, sys.StackAlign)
  sp := newg.stack.hi - totalSize
  spArg := sp
  if usesLR {
    // caller's LR
    *(*uintptr)(unsafe.Pointer(sp)) = 0
    prepGoExitFrame(sp)
    spArg += sys.MinFrameSize
  }

  memclrNoHeapPointers(unsafe.Pointer(&newg.sched), unsafe.Sizeof(newg.sched))
  newg.sched.sp = sp
  newg.stktopsp = sp
  // +PCQuantum so that previous instruction is in same function
  newg.sched.pc = abi.FuncPCABI0(goexit) + sys.PCQuantum
  newg.sched.g = guintptr(unsafe.Pointer(newg))
  gostartcallfn(&newg.sched, fn)
  newg.gopc = callerpc
  newg.ancestors = saveAncestors(callergp)
  newg.startpc = fn.fn
  if isSystemGoroutine(newg, false) {
    atomic.Xadd(&sched.ngsys, +1)
  } else {
    // Only user goroutines inherit pprof labels.
    if _g_.m.curg != nil {
      newg.labels = _g_.m.curg.labels
    }
  }
  // Track initial transition?
  newg.trackingSeq = uint8(fastrand())
  if newg.trackingSeq%gTrackingPeriod == 0 {
    newg.tracking = true
  }
  casgstatus(newg, _Gdead, _Grunnable)
  gcController.addScannableStack(_p_, int64(newg.stack.hi-newg.stack.lo))

  if _p_.goidcache == _p_.goidcacheend {
    // Sched.goidgen is the last allocated id,
    // this batch must be [sched.goidgen+1, sched.goidgen+GoidCacheBatch].
    // At startup sched.goidgen=0, so main goroutine receives goid=1.
    _p_.goidcache = atomic.Xadd64(&sched.goidgen, _GoidCacheBatch)
    _p_.goidcache -= _GoidCacheBatch - 1
    _p_.goidcacheend = _p_.goidcache + _GoidCacheBatch
  }
  newg.goid = int64(_p_.goidcache)
  _p_.goidcache++
  if raceenabled {
    newg.racectx = racegostart(callerpc)
  }
  if trace.enabled {
    traceGoCreate(newg, newg.startpc)
  }
  releasem(_g_.m)

  return newg
}

runtime·mstart

scss 复制代码

// mstart is the entry-point for new Ms.
// It is written in assembly, uses ABI0, is marked TOPFRAME, and calls mstart0.
func mstart()

// mstart0 is the Go entry-point for new Ms.
// This must not split the stack because we may not even have stack
// bounds set up yet.
//
// May run during STW (because it doesn't have a P yet), so write
// barriers are not allowed.
//
//go:nosplit
//go:nowritebarrierrec
func mstart0() {
  _g_ := getg()

  osStack := _g_.stack.lo == 0
  if osStack {
    // Initialize stack bounds from system stack.
    // Cgo may have left stack size in stack.hi.
    // minit may update the stack bounds.
    //
    // Note: these bounds may not be very accurate.
    // We set hi to &size, but there are things above
    // it. The 1024 is supposed to compensate this,
    // but is somewhat arbitrary.
    size := _g_.stack.hi
    if size == 0 {
      size = 8192 * sys.StackGuardMultiplier
    }
    _g_.stack.hi = uintptr(noescape(unsafe.Pointer(&size)))
    _g_.stack.lo = _g_.stack.hi - size + 1024
  }
  // Initialize stack guard so that we can start calling regular
  // Go code.
  _g_.stackguard0 = _g_.stack.lo + _StackGuard
  // This is the g0, so we can also call go:systemstack
  // functions, which check stackguard1.
  _g_.stackguard1 = _g_.stackguard0
  mstart1()

  // Exit this thread.
  if mStackIsSystemAllocated() {
    // Windows, Solaris, illumos, Darwin, AIX and Plan 9 always system-allocate
    // the stack, but put it in _g_.stack before mstart,
    // so the logic above hasn't set osStack yet.
    osStack = true
  }
  mexit(osStack)
}

// The go:noinline is to guarantee the getcallerpc/getcallersp below are safe,
// so that we can set up g0.sched to return to the call of mstart1 above.
//go:noinline
func mstart1() {
  _g_ := getg()

  if _g_ != _g_.m.g0 {
    throw("bad runtime·mstart")
  }

  // Set up m.g0.sched as a label returning to just
  // after the mstart1 call in mstart0 above, for use by goexit0 and mcall.
  // We're never coming back to mstart1 after we call schedule,
  // so other calls can reuse the current frame.
  // And goexit0 does a gogo that needs to return from mstart1
  // and let mstart0 exit the thread.
  _g_.sched.g = guintptr(unsafe.Pointer(_g_))
  _g_.sched.pc = getcallerpc()
  _g_.sched.sp = getcallersp()

  asminit()
  minit()

  // Install signal handlers; after minit so that minit can
  // prepare the thread to be able to handle the signals.
  if _g_.m == &m0 {
    mstartm0()
  }

  if fn := _g_.m.mstartfn; fn != nil {
    fn()
  }

  if _g_.m != &m0 {
    acquirep(_g_.m.nextp.ptr())
    _g_.m.nextp = 0
  }
  schedule()
}

g0的作用

在Go语言中，g0是一个特殊的goroutine（协程），主要在运行时期被用作调度器（scheduler）执行调度循环的场地（栈）。对于一个线程（M）来说，g0总是它第一个创建的goroutine。

g0的作用包括：

调度：g0负责调度其他Goroutine的执行。它是调度器执行调度循环的主要场地，负责获取下一个需要执行的Goroutine。
执行系统调用和阻塞操作：当当前Goroutine需要进行系统调用或阻塞操作时，g0会执行这些操作，并在完成后继续执行原来的Goroutine。
垃圾回收：g0参与垃圾回收过程，例如标记、扫描等操作。
扩容栈：当Goroutine需要扩展其堆栈时，g0会负责进行栈扩容。
执行一些特殊任务：例如创建新的Goroutine、处理defer语句等。

总之，g0在Go语言中扮演了重要的角色，为其他Goroutine提供了调度和执行的场地，并参与了系统调用、垃圾回收和栈扩容等操作。

g0的区别

在Go语言中，执行runtime.main的main goroutine（g0）和运行时期由m创建的goroutine（g0）是两个不同的概念。

执行runtime.main的main goroutine是程序启动后首先被创建的goroutine，特殊的是，这个goroutine会被标记为"系统栈"，并且它是由运行时（runtime）在M0上创建的。一旦创建，这个特殊的goroutine（我们称之为g0）会与M0绑定，并开始执行main函数。在执行过程中，M0需要找到一个空闲的P去捆绑，然后将main函数放入捆绑的P的本地队列中等待执行。

另一方面，每个并发的执行单元被称为一个goroutine，而运行时期创建的g0则更为普遍。特别的是，每一个M都会有一个名叫g0的初代goroutine，此goroutine在M的创建时被创建，它的栈空间默认为 8K。，其主要工作就是进行goroutine的调度、垃圾回收等。不同于执行runtime.main的main goroutine，这个g0的栈是在主线程栈上分配的，并且它的栈空间有64k。

总结来说：main方法的g0是专为程序入口点设计的特殊goroutine，而运行时期由M创建的goroutine（g0）是动态创建的用于执行特定任务的goroutine，它们在创建方式、作用、生命周期和调度上存在区别

运行时期goroutine的空间在堆还是栈分配

在Go语言中，goroutine的栈是动态地分配在堆上的。每个goroutine开始时会在堆上分配一小块栈空间，而这个栈空间会根据需要动态地增长或缩减。

具体来说，运行时包含两个重要的全局变量，分别是 runtime.stackpool 和 runtime.stackLarge，这两个变量分别表示全局的栈缓存和大栈缓存，前者可以分配小于 32KB 的内存，后者用来分配大于 32KB 的栈空间。当一个goroutine被创建时，它会在堆上获得一小块初始栈空间。随着goroutine调用的函数层级的深入或者局部变量需要的越来越多时，运行时会调用 runtime.morestack 和 runtime.newstack创建一个新的栈空间，这些栈空间是不连续的，但是当前goroutine的多个栈空间会以双向链表的方式连接起来。

需要注意的是，频繁的堆栈分配和释放操作可能会造成巨大的开销。例如，如果在一个快速紧密的循环中，连续进行堆栈分配操作，那么分配/释放操作将会造成巨大开销。为了避免这种情况，Go将堆栈的最小值从 2Kb 加到 8Kb，当采用连续堆栈策略后，又将其减小回 2Kb。

同时，Go语言的内存管理实现了主动申请与主动释放管理，增加了逃逸分析和垃圾回收机制，将开发者从复杂的内存管理中解放出来，让开发者有更多的精力去关注软件设计。