ribasushi  6 hours ago
we are still trying to figure out why this is happening

ribasushi  6 hours ago
but we would like a couple folks to test a particular revert

ribasushi  6 hours ago
here are the steps:
DO NOT DO THIS - IT DOES NOT HELP
git fetch
git checkout ntwk-calibration-8.13.1
git submodule update --init --recursive
bash -c 'cd extern/filecoin-ffi; git clean -fxd; git reset --hard HEAD'
patch -p0 <<"EOP"
diff --git a/rust/rustc-target-features-optimized.json b/rust/rustc-target-features-optimized.json
index 3d77062..1092106 100644
--- ./extern/filecoin-ffi/rust/rustc-target-features-optimized.json
+++ ./extern/filecoin-ffi/rust/rustc-target-features-optimized.json
@@ -7,3 +7,3 @@
     "rustc_target_feature": "+sha",
-    "check_cpu_for_feature": "sha"
+    "check_cpu_for_feature": null
   },
EOP
RUSTFLAGS="-C target-cpu=native -g" FFI_BUILD_FROM_SOURCE=1 make all 2>&1 | tee full_build_$(date '+%s').log
(edited)
:heavy_check_mark:
2
:pray:
1


ribasushi  5 hours ago
@kenny, @s0nik42, @hyunmoon, @joe let me know if this works

s0nik42  5 hours ago
patched, it’s compiling

kenny  5 hours ago
damn it's taking a while

hyunmoon  5 hours ago
@ribasushi Unfortunately it didn't work.
lotus version 0.4.4+git.4c5e9614.dirty
2020-08-14T04:26:48.392+0900	INFO	chain	chain/sync.go:1263	Got blocks: 338 336
SIGILL: illegal instruction
PC=0x1f11383 m=24 sigcode=2
goroutine 0 [idle]:
runtime: unknown pc 0x1f11383
stack: frame={sp:0x7f03b65e03c8, fp:0x0} stack=[0x7f03b5de0f08,0x7f03b65e0b08)
...
(lots of errors...)
...
goroutine 4569 [select]:
github.com/ipfs/go-graphsync/responsemanager/peerresponsemanager.(*peerResponseSender).run(0xc013462320)
	/root/go/pkg/mod/github.com/ipfs/go-graphsync@v0.1.1/responsemanager/peerresponsemanager/peerresponsesender.go:380 +0xce
created by github.com/ipfs/go-graphsync/responsemanager/peerresponsemanager.(*peerResponseSender).Startup
	/root/go/pkg/mod/github.com/ipfs/go-graphsync@v0.1.1/responsemanager/peerresponsemanager/peerresponsesender.go:103 +0x3f
rax    0x7f03b65e0510
rbx    0x0
rcx    0x0
rdx    0x7f03b65e0530
rdi    0x7f03b65e0530
rsi    0x7f03b65e04f0
rbp    0x7f03b65e0590
rsp    0x7f03b65e03c8
r8     0x8553f9caa90e05c2
r9     0x7f03b65e0688
r10    0x89a391ba8fa08eb5
r11    0x25e08779c5701c77
r12    0x7f03b65e0530
r13    0x7f03b65e04f0
r14    0x7f03b65e04e0
r15    0x7f03b65e04c0
rip    0x1f11383
rflags 0x10206
cs     0x33
fs     0x0
gs     0x0
^C
[1]+  Exit 2 (edited) 

s0nik42  5 hours ago
:slightly_smiling_face:

ribasushi  5 hours ago
fml, this may not be ffi-related at all...
does any of you have the ability to figure the exact instruction that is failing? grabbing a core dump etc

kenny  5 hours ago
yep, it fails after it starts the libp2p swarm

kenny  5 hours ago
so the daemon starts but it panics after 3-4 seconds

hyunmoon  5 hours ago
@ribasushi I think coredump is disabled by default. I'll look into it later. FYI, both my intel servers have 2 sockets and I'm using all of them. I wonder if others have the similar setup. (edited) 

kenny  5 hours ago
yep 2 sockets
:+1:
1


s0nik42  5 hours ago
I’ve got just one

kenny  5 hours ago
:face_palm:

ribasushi  5 hours ago
ok new plan

s0nik42  5 hours ago
by socket you mean physical CPU correct ?

ribasushi  5 hours ago
everyone gives me their lscpu ( please as a code-or-text-snippet do not paste here )
hopefully I can match at least one of them
and reproduce it myself
then we can fix it

s0nik42  5 hours ago
lscpu-s0nik42.txt 
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Click to expand inline (26 lines)



nemo  5 hours ago
@hyunmoon Try ulimit -c unlimited and then re-run to capture the core dump?

hyunmoon  5 hours ago
lscpu-hyunmoon-xeongold.txt 
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              80
On-line CPU(s) list: 0-79
Click to expand inline (27 lines)



s0nik42  5 hours ago
@ribasushi, I’ve got this when it crashed, does it helps : 2020-08-13T21:38:05.797+0200	INFO	hello	hello/hello.go:117	Got new tipset through Hello: [bafy2bzacecnf5qpgvkma2urcd4dqmguoiu5m4idxupoqa2um63xfzkqxevqc6 bafy2bzacebzxppgauxycxe5ocyjdk3leiqbjb2k7qyxvkmbepgoo64ql5v2zg bafy2bzacedaecj3sfn5wpzm2hll7o7za2h3mnjbdo4isbe37gpxynszzrhjyy] from 12D3KooWF1Z1PRX7bqVFXw7ju8cZCnqfFUbxGoU7kw6UStXAEUpT
SIGILL: illegal instruction
PC=0x1f11383 m=11 sigcode=2
goroutine 0 [idle]:
runtime: unknown pc 0x1f11383
stack: frame={sp:0x7fb7e57f93c8, fp:0x0} stack=[0x7fb7e4ff9f08,0x7fb7e57f9b08)

hyunmoon  5 hours ago
I followed the quick guide here: https://askubuntu.com/questions/966407/where-do-i-find-the-core-dump-in-ubuntu-16-04lts  I did run ulimit -c unlimited. Not getting anything in /var/crash I don't think I'm doing this right. (edited) 

nemo  5 hours ago
Check the current dir? Or where the binary lives?

hyunmoon  5 hours ago
@nemo .pyc files?

nemo  5 hours ago
No, it should just be called core

hyunmoon  5 hours ago
Nope. Can't find it anywhere. If you could give me some instructions on how to do it properly I'll do it. @nemo

nemo  5 hours ago
Sorry, not familiar with ubuntu, but most of the time it's in the current dir where the program is run from

s0nik42  5 hours ago
the core file is 440MB big

nemo  5 hours ago
Yes, can you run gdb coreand get the output of where?

ribasushi  5 hours ago
@s0nik42 it should be very compressible or just do what @nemo says :slightly_smiling_face: (edited) 

s0nik42  5 hours ago
gdb core doesn’t work

s0nik42  5 hours ago
“/root/core”: not in executable format: File format not recognised

s0nik42  5 hours ago
file core
core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from ‘lotus daemon’, real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: ‘/usr/local/bin/lotus’, platform: ‘x86_64’

nemo  5 hours ago
Oh right, gdb ./lotus core

nemo  5 hours ago
or rather gdb /usr/local/bin/lotus corein that case

s0nik42  5 hours ago
root@keats:~# gdb /usr/local/bin/lotus core
GNU gdb (Ubuntu 8.1-0ubuntu3.2) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/bin/lotus...done.
[New LWP 32249]
[New LWP 32241]
[New LWP 32240]
[New LWP 32248]
[New LWP 32242]
[New LWP 32244]
[New LWP 32245]
[New LWP 32243]
[New LWP 32247]
[New LWP 32246]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `lotus daemon'.
Program terminated with signal SIGABRT, Aborted.
#0  runtime.raise () at /usr/local/go/src/runtime/sys_linux_amd64.s:165
165		RET
[Current thread is 1 (Thread 0x7f4e69ffb700 (LWP 32249))]
warning: File "/usr/local/go/src/runtime/runtime-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file, add
	add-auto-load-safe-path /usr/local/go/src/runtime/runtime-gdb.py
line to your configuration file "/root/.gdbinit".
To completely disable this security protection, add
	set auto-load safe-path /
line to your configuration file "/root/.gdbinit".
For more information about this security protection, see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
	info "(gdb)Auto-loading safe path"

nemo  5 hours ago
What about the where output?

s0nik42  5 hours ago
(gdb) where
#0  runtime.raise () at /usr/local/go/src/runtime/sys_linux_amd64.s:165
#1  0x00000000005828cb in runtime.dieFromSignal (sig=6) at /usr/local/go/src/runtime/signal_unix.go:729
#2  0x0000000000582e7e in runtime.sigfwdgo (sig=6, info=0xc00039b4b0, ctx=0xc00039b380, ~r3=<optimised out>) at /usr/local/go/src/runtime/signal_unix.go:943
#3  0x00000000005817d4 in runtime.sigtrampgo (sig=6, info=0xc00039b4b0, ctx=0xc00039b380) at /usr/local/go/src/runtime/signal_unix.go:412
#4  0x00000000005a0943 in runtime.sigtramp () at /usr/local/go/src/runtime/sys_linux_amd64.s:389
#5  <signal handler called>
#6  runtime.raise () at /usr/local/go/src/runtime/sys_linux_amd64.s:165
#7  0x00000000005828cb in runtime.dieFromSignal (sig=6) at /usr/local/go/src/runtime/signal_unix.go:729
#8  0x0000000000582a5a in runtime.crash () at /usr/local/go/src/runtime/signal_unix.go:821
#9  0x0000000000581fed in runtime.sighandler (sig=3, info=0xc00039bbf0, ctxt=0xc00039bac0, gp=0xc000102d80) at /usr/local/go/src/runtime/signal_unix.go:652
#10 0x0000000000581939 in runtime.sigtrampgo (sig=3, info=0xc00039bbf0, ctx=0xc00039bac0) at /usr/local/go/src/runtime/signal_unix.go:452
#11 0x00000000005a0943 in runtime.sigtramp () at /usr/local/go/src/runtime/sys_linux_amd64.s:389
#12 <signal handler called>
#13 runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:567
#14 0x0000000000567ea6 in runtime.futexsleep (addr=0xc0003632c8, val=0, ns=-1) at /usr/local/go/src/runtime/os_linux.go:45
#15 0x0000000000541e1f in runtime.notesleep (n=0xc0003632c8) at /usr/local/go/src/runtime/lock_futex.go:151
#16 0x0000000000571cc0 in runtime.stopm () at /usr/local/go/src/runtime/proc.go:1834
#17 0x0000000000572704 in runtime.gcstopm () at /usr/local/go/src/runtime/proc.go:2034
#18 0x0000000000573fb7 in runtime.schedule () at /usr/local/go/src/runtime/proc.go:2475
#19 0x0000000000575b36 in runtime.exitsyscall0 (gp=0xc01098fb00) at /usr/local/go/src/runtime/proc.go:3269
#20 0x000000000059c90b in runtime.mcall () at /usr/local/go/src/runtime/asm_amd64.s:318
#21 0x0000000000000000 in ?? ()

s0nik42  5 hours ago
@hyunmoon to generate the core file you should set GOTRACEBACK=crash before starting the daemon (edited) 

nemo  5 hours ago
@s0nik42 And this is on Intel (just double checking)?

s0nik42  5 hours ago
Vendor ID:           GenuineIntel
CPU family:          6
Model:               61
Model name:          Intel(R) Core(TM) i5-5300U CPU @ 2.30GHz

ribasushi  5 hours ago
@s0nik42 and one more double check - this very box did run the previous version of lotus flawlessly?

s0nik42  5 hours ago
I cannot tell :slightly_smiling_face: I started it this morning, but it was syncing the chain on the calibration network

s0nik42  5 hours ago
so I guess yes

s0nik42  5 hours ago
do you want me to do something else ?

kenny  5 hours ago
SIGQUIT: quit
PC=0x5a0ba1 m=21 sigcode=0
goroutine 0 [idle]:
runtime.futex(0xc0003f79c8, 0x80, 0x0, 0x0, 0x0, 0x7ff4165ea910, 0x5a07c5, 0x594e, 0x7ff4165ea940, 0x541e1f, ...)
	/usr/local/go/src/runtime/sys_linux_amd64.s:567 +0x21 fp=0x7ff4165ea8c8 sp=0x7ff4165ea8c0 pc=0x5a0ba1
runtime.futexsleep(0xc0003f79c8, 0x0, 0xffffffffffffffff)
	/usr/local/go/src/runtime/os_linux.go:45 +0x46 fp=0x7ff4165ea918 sp=0x7ff4165ea8c8 pc=0x567ea6
runtime.notesleep(0xc0003f79c8)
	/usr/local/go/src/runtime/lock_futex.go:151 +0x9f fp=0x7ff4165ea950 sp=0x7ff4165ea918 pc=0x541e1f
runtime.stopm()
	/usr/local/go/src/runtime/proc.go:1834 +0xc0 fp=0x7ff4165ea978 sp=0x7ff4165ea950 pc=0x571cc0
runtime.findrunnable(0xc00004e800, 0x0)
	/usr/local/go/src/runtime/proc.go:2366 +0xa0d fp=0x7ff4165eaa60 sp=0x7ff4165ea978 pc=0x5732dd
runtime.schedule()
	/usr/local/go/src/runtime/proc.go:2526 +0x2fc fp=0x7ff4165eaac8 sp=0x7ff4165eaa60 pc=0x573e1c
runtime.park_m(0xc011f16d80)
	/usr/local/go/src/runtime/proc.go:2696 +0x9d fp=0x7ff4165eaaf8 sp=0x7ff4165eaac8 pc=0x57439d
runtime.mcall(0x800000)
	/usr/local/go/src/runtime/asm_amd64.s:318 +0x5b fp=0x7ff4165eab08 sp=0x7ff4165eaaf8 pc=0x59c90b
rax    0xca
rbx    0xc0003f7880
rcx    0x5a0ba3
rdx    0x0
rdi    0xc0003f79c8
rsi    0x80
rbp    0x7ff4165ea908
rsp    0x7ff4165ea8c0
r8     0x0
r9     0x0
r10    0x0
r11    0x286
r12    0x3
r13    0xc0065cc480
r14    0x2e
r15    0x55
rip    0x5a0ba1
rflags 0x286
cs     0x33
fs     0x0
gs     0x0
Aborted (core dumped)

kenny  5 hours ago
it worked flawlessly for me before this last reset

kenny  5 hours ago
it looks like the same thing

nemo  5 hours ago
@s0nik42 Not just yet, looking at a bit

joe  5 hours ago
I use make clean & make all to compile directly without error. I think this message has certain reference value @nemo

s0nik42  5 hours ago
I will be around until the top of the hour, let me know

David Zhang  5 hours ago
@joe in which branch? next ?

joe  4 hours ago
ntwk-calibration-8.13.1

nemo  4 hours ago
@joe But the issue is runtime, no?

s0nik42  4 hours ago
it is

joe  4 hours ago
Yes, compile without adding parameters,the program can run normally @nemo

nemo  4 hours ago
@joe So you just built and ran that (on Intel) and everything worked as expected?  I'm confused on what you mean here (edited) 

ribasushi  4 hours ago
@joe I am also confused - can you clarify what worked?

kenny  4 hours ago
same for me, I've removed the build env variables

kenny  4 hours ago
and it works with simple, make all

kenny  4 hours ago
I;m running the daemon without problems now

kenny  4 hours ago
might not be optimised but it doesn;t crash

s0nik42  4 hours ago
without the patch neither @kenny ?

ribasushi  4 hours ago
ok so if you guys do not recompile the ffi
but instead download the precompiled version we publish

ribasushi  4 hours ago
it just works?

kenny  4 hours ago
yes

kenny  4 hours ago
without anything

ribasushi  4 hours ago
ok this is a helpful datapoint too!
I am done with my meetings now, will dig into reproducing this locally for the ffi team
:+1:
2


s0nik42  4 hours ago
@ribasushi I’m sorry, I miss something how do we download the precompiled one you published ?

s0nik42  4 hours ago
My process is still crashing :confused:

kenny  4 hours ago
do just make all without adding env vars for build from source

s0nik42  4 hours ago
@kenny, shall I : git check also

s0nik42  4 hours ago
?

s0nik42  4 hours ago
fetch

ribasushi  4 hours ago
one moment, I will writeup an instruction

kenny  4 hours ago
I didnt do anything special

s0nik42  4 hours ago
@ribasushi yes please , thank you :slightly_smiling_face:

ribasushi  4 hours ago
git fetch
git checkout ntwk-calibration-8.13.1
git submodule update --init --recursive
bash -c 'cd extern/filecoin-ffi; git clean -fxd; git reset --hard HEAD'
RUSTFLAGS="" FFI_BUILD_FROM_SOURCE="" make all 2>&1 | tee full_build_$(date '+%s').log

nemo  4 hours ago
@ribasushi This is how it's supposed to work though, right?  End users don't usually rebuild ffi do they?

s0nik42  4 hours ago
it works

s0nik42  4 hours ago
the chain is now fully synced

ribasushi  4 hours ago
@nemo we recommend to users to rebuild in most docs :disappointed:

ribasushi  4 hours ago
for extra oompf

nemo  4 hours ago
Ok, that's news to me.  I think the reason it was designed that way (i.e. to not rebuild) was put there for a reason, but I don't know the history. That's how laser explained to me was the recommended way except for development (edited) 

s0nik42  4 hours ago
@ribasushi @nemo but that’s the default behaviour of “make all”  isn’t it ?

ribasushi  4 hours ago
@s0nik42 shouldn't be, you probably have stuff in your env from earlier setups

nemo  4 hours ago
Not unless the env var is set as far as I know

s0nik42  4 hours ago
which var  ?

nemo  4 hours ago
FFI_BUILD_FROM_SOURCE=1

s0nik42  4 hours ago
ah yes !

s0nik42  4 hours ago
I do that, I found it somewhere but cannot reckon where

s0nik42  4 hours ago
so I will remove it from my script thank you
:+1:
1


ribasushi  4 hours ago
in any case - this worked before, we will look at fixing it, but likely won't happen today

s0nik42  4 hours ago
also, I’m using : -C target-cpu=native -g

s0nik42  4 hours ago
shall I keep these ?

s0nik42  4 hours ago
RUSTFLAGS=“-C target-cpu=native -g”

ribasushi  4 hours ago
it won't apply to anything without the other var, it doesn't matter

nemo  4 hours ago
Yes, keep those

s0nik42  4 hours ago
ok thank you very much, have a nice day guys, and thanks again for the hard work

hyunmoon  3 hours ago
@nemo @s0nik42 Oh shoot. I've been doing that the whole time as well! env FFI_BUILD_FROM_SOURCE=1 RUSTFLAGS="-C target-cpu=native -g" make clean deps all install (edited) 

hyunmoon  3 hours ago
I'm gonna go ahead and try without FFI_BUILD_SOURCE right now.

hyunmoon  3 hours ago
It worked!

ribasushi  3 hours ago
note - it should work the other way too, we just have not had a chance to figure out why it stopped
:heavy_check_mark:
2


BenjaminH  3 hours ago
I just compiled it from source on a Intel Xeon node, and it runs the lotus-worker without any issues... so far. I’m running it dedicated for commit phase. I will let you know if there is also problems there. I got the first batch in PC1. Great you figured out lotus daemon :+1: