Peter Cai

@PeterCxy

Some random guy out there. en_US / zh_CN

5,471 words

https://sn.angry.im/@PeterCxy Guestbook
You'll only receive email when Peter Cai publishes a new post

坚果 Pro 2 开启电信 VoLTE 支持的方法 (ALPHA)

2018.12.01 更新

  • 已经有人对该方法进行了封装以便直接刷入,并且给出了使用该封装的更简单的教程,如果您不想折腾这么多请移步 http://luo2888.xyz/post/25.html
  • 实测证明在完成破解电信 VoLTE 以后再刷入官方底包更新不会导致破解失效 (最后测试所用底包版本为 2018-10-29), 如果不放心建议每次刷机前备份完整的 NV,推荐直接用高通的 QPST 工具备份一份 QCN 出来

原文

由于一些众所周知的问题,锤子科技至今未向坚果 Pro 2 推送电信 VoLTE 的配置文件。然而电信 VoLTE 已经于 2018 年 11 月 29 日正式开始试商用。作为购买了坚果 Pro 2 作为日用机和开发机的开发者,手机如果长时间不支持 VoLTE 实在是令人感到浑身难受,况且该机使用的硬件明显是应当支持电信 VoLTE 的 —— 同样使用该平台 (SDM660) 的红米 Note 5 就早早推送了电信 VoLTE 的基带配置文件。因此,我很早就有想法将红米 Note 5 的配置文件移植到坚果 Pro 2 上,无奈本地电信之前迟迟没有进行 VoLTE 测试。而今天总算跟着全国一起开放了 VoLTE 支持,我也总算可以将这个想法付诸实践了。

在继续阅读之前请注意,该方法由我在我的两部坚果 Pro 2 上测试通过,但是我并不知道它为什么可以工作(尤其是中间那一步为什么要把小米的文件整个覆盖进去一遍……)。 It just works. 如果您继续跟随这篇文章,您同意自行承担一切可能的后果。

首先,您需要准备一个已经 ROOT 过的坚果 Pro 2,并且请确保您使用的 ROM 有 VoLTE 支持(第三方ROM中,魔趣 8.1 以及由我维护的 Nitrogen OS 最新版本均支持)。然后,您需要确保您具有基本的 adb / 命令行操作知识。您需要下载以下文件并解压到手机存储卡备用

https://www.androidfilehost.com/?fid=11410963190603862029

在我的两部手机上,可以起作用的详细过程如下:(仅在 2018-09-04 版本的基带固件上测试通过,且升级底包/基带后您将需要重新执行这些步骤!)

  1. 使用 adb shell 进入手机的命令行,并运行 su 以进入 ROOT 用户
  2. 执行 mount -o remount,rw /firmware
  3. 执行 cd /firmware/image/modem_pr/mcfg/configs 进入基带配置目录
  4. 将该目录下的 mcfg_sw 拷贝到 sd 卡上作为备份以防出错后无法恢复 (cp -r mcfg_sw /sdcard/some/path/)
  5. 删除 mcfg_sw (rm -rf mcfg_sw)
  6. 将刚刚解压出来的文件中的 mcfg_sw.xiaomi.flash_first 复制到该目录并改名为 mcfg_sw (cp -r /path/to/mcfg_sw.xiaomi.flash_first ./ && mv mcfg_sw.xiaomi.flash_first mcfg_sw)
  7. 删除 /data/vendor/radio 目录 (rm -rf /data/vendor/radio)
  8. 重启手机,待出现信号图标以后再重启一次,此时开机进入系统应当是无法使用电信 4G 网络的状态(至少在我的两部手机上是这样的)
  9. 此时再次重复步骤 3 进入基带配置目录,再次重复步骤 5 删除 mcfg_sw, 然后将解压出来的文件中的 mcfg_sw2.working 以类似于步骤 6 的方式复制到基带配置目录并改名为 mcfg_sw
  10. 重启手机,此时在拨号盘中输入号码后应该就能看见视频通话选项了,出现该选项则说明 VoLTE 开启成功。您也可以通过拨号 *#*#4636#*#* 选择第一个 Phone Information 并在右上角的菜单中选择 IMS Service Status 来查看 IMS (也就是 VoLTE) 的注册成功状态。(这个 IMS Service Status 将只会显示卡槽 1 内的状态)

以上过程中,

mcfg_sw.xiaomi.flash_first 该目录是我从朋友的红米 Note 5 的相同目录 (/firmware/image/modem_pr/mcfg/configs) 中提取的配置文件,而 mcfg_sw2.working 是我将红米 Note 5 的配置文件中的 generic/china/ct 目录覆盖到坚果 Pro 2 配置文件中的 generic/china/ct 以及 generic/smartisa/ca/ct,并修改 oem_sw.txt 中的 hvolte_ovolte_op 而成。

原本我以为只要替换电信配置文件到坚果 Pro 2 自带的配置文件中就可以使用,但是当我完成备用机上测试以后尝试在主力机上使用这种方式却不可用。仔细回忆才想起来我在备用机上测试的时候曾经死马当活马医,把整个小米的配置目录都覆盖到锤子的目录上,结果当然是一切都不可用;但是当我做了这件事情以后,再替换回锤子自带的配置文件,只保留小米的电信配置文件,VoLTE 就奇迹般地可用了。这也是以上步骤中令人摸不着头脑的第 6 步的由来。

由于最近实在太过繁忙,以及太过激动于向大家分享这个成果,我还没有来得及仔细调查为什么一定需要第 6 步这样的不明所以的魔法步骤。我所知道的只有,如此操作是可用的。现在怀疑的是配置目录中有某些文件会向 EFS 中写入特定的参数,或者其实仅仅需要在单独替换了中国电信的配置文件以后使用高通 QPST 工具中的 PDC 来激活一下对应的电信 VoLTE profile 就可以不必使用第 6 步了。我现在并没有答案。

当然,也许甚至可以直接封装一个 NON-HLOS.bin 出来,让所有用户只需要线刷一下就可以用上电信 VoLTE。当然,这些事情都要等我有更多空闲的时候来慢慢研究……

Nitrogen OS for Smartisan U3 Pro (osborn; Nut Pro 2)

This page is used to publish updates on my self-built unofficial NitrogenOS for Smartisan U3 Pro, codenamed "osborn", also known as "Nut Pro 2 (坚果 Pro 2)".

本页面用于发布本人自制的 Nitrogen OS for Smartisan U3 Pro (坚果 Pro 2; osborn) 的更新。

Before flashing

You should always be noted that this phone has NO official support for bootloader unlocking. There is only ONE WAY to flash custom ROMs on this phone, which is to downgrade to a previous flawed bootloader, which accepts all boot.img signatures, via Qualcomm USB-DL (9008 mode).

您应当知晓,这个手机并没有官方的 bootloader 解锁支持。在这个手机上刷非官方 ROM 只有一个方法,那就是降级到以前的一个有漏洞的 bootloader。这个 bootloader 需要通过高通 9008 模式降级。

This ROM might also require the latest modem image. You could find details online on how to flash this phone, mostly from a developer named XiNGRZ, whose code I based my ROM on.

这个 ROM 可能也需要最新的 modem 镜像 ("底包") 以正常工作。你可以在网上找到关于这些的信息,大多数来自名叫 XiNGRZ 的开发者。他的代码也是我用来适配该 ROM 的基础。

Firmware / 底包

You MUST install the latest firmware ZIP package before installing the latest ROM. Check here regularly for firmware updates. DO NOT update only to the latest firmware but not the latest ROM and vice-versa.

必须 在刷 ROM 之前安装对应的底包。您可以在此处定期检查底包的更新。请 不要 只刷底包但不刷最新的 ROM,反之亦然。

当前底包版本 / Current Firmware Version: 20181029203642

Update logs

20181209-1

  • Hotfix: fixed boot problems on 256G version of the phone
  • 紧急修复: 修复了在 256G 顶配版上无法启动的问题

20181208-1

此版本在 256G 顶配版上存在会导致系统崩溃的 BUG,请使用顶配版的不要升级

  • Synchronized with Nitrogen OS upstream
  • 更新了 Nitrogen OS 上游代码
  • Merged Linux 4.4.166
  • 合并了 Linux 4.4.166

20181201-1

  • Synchronized with Nitrogen OS upstream
  • 更新了 Nitrogen OS 上游代码
  • Merged Linux 4.4.165
  • 合并了 Linux 4.4.165
  • Updated firmware from Smartisan OS. You may need to** flash the new firmware** from https://www.androidfilehost.com/?fid=11410963190603863206 before installing this update
  • 更新了锤子官方固件。您可能需要先刷底包 https://www.androidfilehost.com/?fid=11410963190603863206 再刷入本次更新
  • Removed the dynamic navigation bar feature added last week due to extremely instability and a ton of bugs and conflicts with the newly-introduced Nitrogen OS official navbar mods. I will add it back in the future when I can spend some time to polish it up.
  • 移除了上周加入的实验性导航栏变色功能,因为实在太过不稳定以及 BUG 实在太多,而且与本周 Nitrogen OS 新加入的导航栏自定义功能冲突。我将会在我有时间修复那些 BUG 以后加回这个功能。

20181124-1

  • Synchronized with Nitrogen OS upstream
  • 更新了 Nitrogen OS 上游代码
  • Merged Linux 4.4.164
  • 合并了 Linux 4.4.164
  • Added multi-weight Noto CJK fonts (which bloated the zip file)
  • 增加了 CJK 多字重字体(导致刷机包体积增加)
  • An experimental feature: dynamic navigation bar tinting according to the status bar color; for those apps that only tint the status bar but not the navbar. Not ready for daily use yet but you can try it in Settings -> Personalization -> Navigation bar. It is really pain in the ass to debug a ROM when I can only build on a remote HDD server. I think I will finish this feature when I get my new AMD YES workstation built.
  • 实验性功能: 根据状态栏颜色动态调整导航栏颜色,为某些只改变状态栏颜色而不改变导航栏颜色的 app 设计。这个功能暂时还非常原始,不适合日常使用,有很多问题还没有来得及解决,但是你可以在 Settings -> Personalization -> Navigation bar 里面试用它。现在我只能使用远程服务器编译,调试 ROM 简直太过痛苦,这个功能的那些问题我应该会在配好新的 AMD YES 工作站以后解决。

20181111-1

  • Synchronized with Nitrogen OS upstream
  • 更新了 Nitrogen OS 上游代码
  • Merged Linux 4.4.163
  • 合并了 Linux 4.4.163
  • Made an attempt to temporarily fix the Wi-Fi not turning on problem on some users' devices. If this works, please tell me and I'll attempt to fix it permanently
  • 尝试临时修复部分用户手机上的 Wi-Fi 无法打开问题。如果这个版本不再出现这个问题,请及时告诉我以便我详细查看具体原因

20181031-3

  • SELinux is now globally enforcing, though several privileged processes are still in permissive mode. If anything breaks after this update, please let me know immediately
  • 全局打开了 SELinux Enforcing 模式;有几个特权进程仍然在 Permissive 模式下,需要后续解决。如果本次升级后有功能出现以前没有问题,请及时告知我
  • Fixed system ANR on boot
  • 修复开机时系统卡死的问题
  • Improved edge touch handling
  • 优化边缘误触算法
  • Improved power efficiency
  • 优化电量消耗

20181021-2

  • VoLTE is finally working, tested with CMCC.
  • VoLTE 终于修复好了,用中国移动测试通过
  • Upgraded Linux kernel 4.4.162
  • 升级到 Linux 4.4.162
  • Dirty-fixed compatibility with some third-party camera apps which had problems with recording. Tested with Google Camera.
  • 脏修复了对部分第三方相机 app 的录像功能的兼容性问题。使用 Google Camera 测试通过。
  • Merged updates from Nitrogen OS
  • 合并了来自 Nitrogen OS 的更新

20181012-4

  • Synchronized VoLTE cobfiguration, but I can't test if it works
  • 同步了来自蛋丁的蛋香 VoLTE 配置,但由于没有支持的运营商,我无法测试
  • Fixed in-call audio and recording volume by bringing back Smartisan's proprietary driver
  • 使用锤子私有驱动修复了通话及录音音量小的问题
  • Rewrote SELinux policies, though it is still not ready for enforcing
  • 重写了 SELinux 配置,但仍然不足以默认开启 enforcing 模式
  • Upgraded Linux kernel 4.4.160
  • 升级到 Linux 4.4.160
  • Merged updates from Nitrogen OS
  • 合并了来自 Nitrogen OS 的更新

20181007-4

  • Dirty-fixed the hardware encoder problem. Now the camrecorder should work perfectly.
  • 用很脏的办法修复了硬件编码器问题。现在录像应该能完全正常工作了。
  • Re-enabled 4K UHD recording.
  • 重新打开了 4K UHD 录像功能。

20181007-1

  • Updated Qualcomm proprietary blobs
  • 更新了高通私有库
  • Temporarily disabled hardware encoder to fix camera recording (now the recording feature works just fine)
  • 暂时禁用硬编码以暂时解决相机的录像问题 (现在录像功能已经完全正常)
  • Temporarily disabled 4K UHD recording because software encoder doesn't support it
  • 暂时禁用了 4K UHD 录像,因为软件编码器不支持
  • Fixed permissions for framebuffer devices
  • 修复了 framebuffer 设备的权限问题

20181004-5

  • First release
  • 第一个版本

Google Camera

I use Google Camera from here with version MGC_6.1.013_MiMAX2_V1b_A8.1+fix_Hexagon_failed+blFront.

我使用 这里 提供的 Google 相机,版本 MGC_6.1.013_MiMAX2_V1b_A8.1+fix_Hexagon_failed+blFront

Donation

Your generous support will provide great motivation for continued maintenance of my ROMs.

您的支持将给我很大的鼓励以持续维护 ROM。

If you re-post my ROMs, please kindly keep the donation information.

如果您转发了我的 ROM,请您保留捐赠信息。

Patreon: https://www.patreon.com/PeterCxy

Alipay / Wechat Pay (支付宝 / 微信):

Download Links

20181209-1

History versions

历史版本

20181208-1

20181201-1

20181124-1

20181111-1

20181031-3

20181021-2

20181012-4

20181007-4

20181007-1

20181004-5

Other Links

How to enable VoLTE for China Telecom on this phone:

如何开启电信 VoLTE:

https://listed.standardnotes.org/@PeterCxy/3501/pro-2-volte-alpha

Source code

The local_manifests to synchronize my device tree into the Nitrogen source code is available at https://github.com/PeterCxy/local_manifests/tree/p.

All my devices and modified source codes are available at https://github.com/PeterCxy.

Wireguard with Network Namespace + BitTorrent / Shadowsocks / ...

Backrgound

I have long been running a BT/PT download box on one of my dedicated servers. The reason that I have extremely poor uplink at my home broadband and running any kind of P2P software is simply killing the network. However, putting those software on a server without any protection is a bad idea -- they will happily announce your server IP everywhere, and, * cough *, some nasty things may happen to you, even by just downloading some pretty innocent files. I need at least some kind of protection to avoid leaking the real IP to the torrent world. Using SOCKS5 proxy alone is not the best idea either: Anything in the BT protocol, for example, DHT, can easily leak the IP address, if the BT client itself is not isolated in a way that it can't see the real IP.

This is the same with my personal proxy service. Residing in China, there is basically no way to connect to VPN services abroad directly, even without them being blocked -- ISPs here just throttle UDP traffic in an extreme way, and TCP VPNs are unbearably slow and easily interrupted with RST. Normally we use self-hosted encrypted proxies instead of VPNs to bypass this, usually hosted on cheap VPSes such as Vultr. However, with this way, it is easy to leak the proxy IP (the VPS IP) to software because they can simply record the mapping between the source IP and the account holder. What I need is still another layer of protection -- that I should use a different outbound IP than the server itself.

Unfortunately, enabling VPN on a server is not something as easy as doing it on your own computer. You can't just simply set the default route, because by doing so, access to the server through its main IP will be broken, and you will be left locked, lonely, helpless, outside of the server. Moreover, only enabling VPN is not enough at all, since the public IP is assigned on the primary network device, and it is fairly simple to fetch that address (and many software will actually do this, announcing every possible IP to the public). A full isolation of network is needed, but I do not want to introduce a complete container like Docker, because it seems just way too excessive.

Network Namespace

Luckily, Linux has this implemented for us. The ip-netns(8) tool manages a cool feature brought by the Linux kernel, Network Namespace, which is exactly what we need here. Actually, full container implementations will also leverage this feature to virtualize their network environment, but we are only using the network part here, which is much more lightweight than a container virtualization.

A network namespace is logically another copy of the network stack, with its own routes, firewall rules, and network devices.

So, all we have to do is to find some way to put the VPN tunnel device in a network namespace, and set the default route only in that namespace. There will be nothing but the VPN device and the only default route visible inside the namespace, which is pretty safe for most software not designed to intentionally escape from namespaces.

The Legacy of OpenVPN

Previously I was a user of ProtonVPN, which was a great VPN to use for my purpose (except that it has completely no IPv6 support, I was expecting VPNs to implement IPv6 NAT...). Since it used OpenVPN as its main VPN software, I used to make use of OpenVPN's up and down scripts to enable VPN in network namespaces.

Since OpenVPN is a pretty old and widely-adopted protocol, there are plenty of guides on how to realize this with OpenVPN. What I used was a script here that moves the TUN interface into a network namepsace managed by the script upon finalizing the connection. The script is pretty mature, and works just fine.

However, ProtonVPN is starting to breaking down these days. Though I have no idea, but since some random day, ProtonVPN started to become null routed randomly. I am sure it is not blocked by ISP because I only run it on my VPS outside of China and I really cannot see routes to its IPs in my BGP sessions elsewhere. It just seems to be down without reason. Besides, OpenVPN is much too bloated and sometimes causing problems itself. Since Linus Torvalds has said that Wireguard should be merged into mainline Linux kernel soon, I started to look for an alternative solution based on Wireguard.

Attempt: Wireguard + wg-quick

After some searching I found a pretty good Wireguard VPN provider with both IPv4 and IPv6 NAT support. Wireguard is pretty easy to configure, since the provider will often provide something lie this:

[Interface]
PrivateKey = blahblah
Address = 192.168.x.x/24, fe80::xxx/64
DNS = x.x.x.x

[Peer]
PublicKey = blahblah
AllowedIPs = 0.0.0.0/0,::0/0
Endpoint = x.x.x.x:xxxx

which is normally placed in /etc/wireguard/wireguard-config-name.conf. Such configuration is meant for the tool wg-quick(8). However, this tool doesn't seem to support Network Namespace out of the box. I did a naïve attempt like below:

ip netns add vpn
ip netns exec vpn ip link add dev wireguard-vpn type wireguard
ip netns exec vpn wg-quick my-config-name

...and of course, it failed. Wireguard will also obey the network namespace rules while establishing its underlying sockets, and that was why this failed -- you can't connect to any VPN in a newly-created network namespace without any route. Resolving this by introducing the host network to the namespace didn't seem appealing to me, since it will be very complex to configure and will still potentially leak the real IP.

The Real Solution

After some Google-fu, I found an official document of Wireguard that described an interesting property of the Wireguard driver: it "remembers" the network namespace where it was created.

it remembers the namespace in which it was created. "I was created in namespace A." Later, WireGuard can be moved to new namespaces ("I'm moving to namespace B."), but it will still remember that it originated in namespace A.

WireGuard uses a UDP socket for actually sending and receiving encrypted packets. This socket always lives in namespace A – the original birthplace namespace.

This is exactly what we were looking for! If Wireguard could send its underlying UDP packets in a different namespace than where the Wireguard device is, we can have a completely "clean" network namespace that has only the Wireguard as default route while having Wireguard being able to connect via the original host network!

All we have to do now is, create the Wireguard interface, then apply the configuration, and move it to a newly-created network namespace, then set the IPs, routes etc. We can no longer use wg-quick for this, because the tool is meant for quick configuration and will configure the routes for us in the main namespace (according to AllowedIPs). We have to use a weaker version of it, called wg setconf, instead. Note that we have to comment out the DNS and Address lines in the provided configuration if present, because wg setconf does not support setting DNS and IP address.

I tried with a simple script according to the above procedure

#!/bin/bash
CONFIG_NAME="$1"
DEV_NAME="wg-$CONFIG_NAME"

ip netns add $CONFIG_NAME
ip netns exec $CONFIG_NAME ip link set lo up
ip link add dev $DEV_NAME type wireguard
wg setconf $DEV_NAME /etc/wireguard/$CONFIG_NAME.conf
ip link set $DEV_NAME netns $CONFIG_NAME up

Note that I have set the name of the namespace to be the same as the configuration file name. Run it with ./script.sh wireguard-config-name, and it successfully set up the namespace with the Wireguard device in it. However, the IP addresses was not set because we did not use wg-quick and commented out the Address line in configuration. At this point, I could have simply hard-coded the addresses in the script, but it did not sound like an elegant solution

I did a not-so-elegant-but-better-than-nothing hack, which was to make use of the commented-out Address line: we could simply parse the line (ignoring the #) and extract the addresses from there!

addrs=$(grep -oP "#Address = \K(.*)" /etc/wireguard/$CONFIG_NAME.conf)
IFS=", "; for addr in $addrs; do
  if [[ $addr = *":"* ]]; then
    # IPv6
    ip netns exec $CONFIG_NAME ip -6 addr add $addr dev $DEV_NAME
  else
    # IPv4
    ip netns exec $CONFIG_NAME ip addr add $addr dev $DEV_NAME
  fi
done

Adding this to the previous script, we now have the IP properly assigned to the Wireguard device. Now, we could pretty much do the same with the routes, by extracting them from AllowedIPs, but somehow I decided that it was better to just set the default routes for both IPv4 and IPv6

ip netns exec $CONFIG_NAME ip route add default dev $DEV_NAME
ip netns exec $CONFIG_NAME ip -6 route add default dev $DEV_NAME

Now we are done with the script to set the interface up. Tearing it down is much simpler

#!/bin/bash
CONFIG_NAME="$1"

ip netns del $CONFIG_NAME

Running Systemd Services inside the Namespace

At this point, we can use ip netns exec to run programs inside the network namespace. However, I would like to run systemd services inside it. To fully leverage the abilities of systemd, I decided to first write a service to manage the Wireguard interface in network namespace. Assuming that the up and down scripts described above are placed in /path/to/wg-up.sh and /path/to/wg-down.sh, I wrote a service named wg-netns@.service

[Unit]
Description=Execute Wireguard in network namepsace
After=network-online.target

[Service]
User=root
Type=oneshot
RemainAfterExit=true
ExecStart=/path/to/wg-up.sh %i
ExecStop=/path/to/wg-down.sh %i

[Install]
WantedBy=multi-user.target

Then enable it by systemctl enable wg-netns@wireguard-config-name. Now, we can use systemctl edit some-service to put some-service into the namespace by writing

[Unit]
Requires=wg-netns@wireguard-config-name.service
After=wg-netns@wireguard-config-name.service

[Service]
User=
User=root
ExecStart=
ExecStart=/usr/bin/ip netns exec wireguard-config-name /path/to/the/program

in the editor provided by systemctl edit. Note that this configuration is very generic, and you may need to consult the original service file for the complete command to put in place of /path/to/the/program. Besides, by using such configuration, you are also running the program as root, which can be a security concern and could make some program behave abnormally. You may need to add su -u blah before the actual command (after ip netns exec wireguard-config-name) to switch to the proper user to run your program.

Now you can enable the service as normal. Services configured like this will only start when wg-netns@wireguard-config-name is started, and will restart or stop if wg-netns@wireguard-config-name is restarted or stopped.

One More Thing: Exposing Ports within the Namespace

All the configuration above are perfectly fine if we do not need any service running in the namespace to be accessible to the outside. But for the BT client and the Shadowsocks server, we must at least be able to access their listening TCP port in order to control / use them while retaining the isolation. My solution was to set up a separate veth interface and assign the namespace a separate internal IP address without NAT, so that I can access the ports via the internal IP or forward them to the outside while forbidding the services themselves to break the isolation.

This step is much simpler. We just create a pair of veth devices, put one of them into the namespace, then assign a pair of IPs to each end.

ip link add dev "$CONFIG_NAME"0 type veth peer name "$CONFIG_NAME"1
ip link set "$CONFIG_NAME"0 up
ip link set "$CONFIG_NAME"1 netns $CONFIG_NAME up
ip addr add $PRIVATE_ADDRESS_HOST dev "$CONFIG_NAME"0
ip netns exec $CONFIG_NAME ip addr add $PRIVATE_ADDRESS_CLIENT dev "$CONFIG_NAME"1

...where PRIVATE_ADDRESS_HOST is the internal address to be assigned to the host and PRIVATE_ADDRESS_CLIENT is the address to be assigned to the client. This is normally something like 192.168.1.1. In the script, I actually wrote like

source ${BASH_SOURCE%/*}/ext/$CONFIG_NAME.conf
if $PRIVATE_VETH_ENABLED; then
  ip link add dev "$CONFIG_NAME"0 type veth peer name "$CONFIG_NAME"1
  ip link set "$CONFIG_NAME"0 up
  ip link set "$CONFIG_NAME"1 netns $CONFIG_NAME up
  ip addr add $PRIVATE_ADDRESS_HOST dev "$CONFIG_NAME"0
  ip netns exec $CONFIG_NAME ip addr add $PRIVATE_ADDRESS_CLIENT dev "$CONFIG_NAME"1
fi

..so that you can have a ext/wireguard-config-name.conf (relative to the location of the up script, corresponding to /etc/wireguard/wireguard-config-name.conf) with additional variables about the internal IP which is not related to Wireguard itself

#!/bin/bash
PRIVATE_VETH_ENABLED=true
PRIVATE_ADDRESS_HOST="192.168.123.1/24"
PRIVATE_ADDRESS_CLIENT="192.168.123.2/24"

Correspondingly, you have to do something to tear down the veth pair in the down script

source ${BASH_SOURCE%/*}/ext/$CONFIG_NAME.conf

if $PRIVATE_VETH_ENABLED; then
  ip netns exec $CONFIG_NAME ip link del dev "$CONFIG_NAME"1
  ip link del dev "$CONFIG_NAME"0
fi

You can then set up port forwarding or anything else to this internal IP.

Now you have a complete working setup of Wireguard inside network namespace.

Source code

I have uploaded the source code of my completed setup to https://git.angry.im/PeterCxy/wg-netns.

Troubleshooting a mysterious Mastodon bug: the Accept-Encoding header and federation

The story

As you may all know, I am the administrator of a Mastodon instance, https://sn.angry.im. One thing that is really fun doing this job (and every SysAdmin job) is that you run into different problems from time to time, sometimes without doing anything or sometimes after some upgrade.

Last week, Mastodon v2.4.0 was out and I, along with my friend, admin at https://cap.moe, decided to upgrade to the new release as quickly as possible. Since there was nothing breaking in the new version, it didn't take long before we both finished executing a few Docker commands and restart into the new version. As usual, we tried to post something to ensure that everything works fine after any upgrade, and this is where things started to break.

We first noticed that I cannot see anyone on cap.moe on my home timeline, while he could see everyone from my instance on his timeline. We thought this was a problem of subscription, so we both did a resubscription task in the administrator panel of our Mastodon instances. However, it was not fixed in any way by this. We then tried to mention each other in a toot to find out if it was because a timeline logic error, but it was not. Still, he could see me but I can't see anyone on his instance.

One thing interesting is that, since some other instances, for example, pawoo.net, can see both of our instances' posts, I can simply retoot one of his toots on pawoo and I will receive the toot on my instance in several seconds. I didn't know what this meant, but it was really something 面白い.

Since other mysterious bugs have happened before and just magically fixed themselves after a while, I decided that it was a good idea to leave it alone and see if things go back to normal. Now it is a week after the initial upgrade, and nothing has changed throughout the entire week, and I can't bear a Mastodon timeline without the jokes from fakeDonaldTrump account of cap.moe to fill my spare time anymore. I finally decided to troubleshoot this "bug".

Attempts

My first idea was that it could be caused by some errors in the task queue or something in the database, both of which could be reset by applying an instance block and removing it after everything is cleared from my instance, at least this was what I believed. This, obviously, was not the case. After removing the instance block, everything was still like what they were before. Mastodon provides no support for really removing users anyway, at least in the database. As what the admin of cap.moe said:

This is completely suicide attack.

If you are an administrator, do NEVER attempt anything that works like a suicide attack, because it solves nothing but adds complexity.

The only option left here is to dump all the traffic and see what's going wrong with the requests. As I had already known, the ActivityPub protocol, which Mastodon relies on, uses active pushes rather than passive pulls to distribute messages. Thus, it could be something on my side that prevented the push to succeed. I decied to capture all the traffic by tcpdump and inspect it using Wireshark.

Since all the traffic of my Mastodon instance is HTTPS-encrypted behind a reverse proxy, I could only dump all the traffic between Nginx and the upstream, then feed all of them into Wireshark to filter by HTTP headers. This was a pain, but I eventually did it and figured out something from the traffic: My instance was replying with 401 Unauthorized to the pushes from cap.moe.

A little inspection into the source code indicated that such error is linked to signature verification. Each ActivityPub needs to be signed by an Actor's private key, which can be verified using the public key. I assumed that this could only be caused by database errors -- my database must have stored a different public key from the original one, either by an error in database upgrade or some random cosmos radiation. I checked the public key by

account = Account.find(id_on_cap_moe)
account.public_key

in the Ruby console of Mastodon. I also asked the admin of cap.moe to run the same command with the id on his own instance, and then we compared the output public key. Unfortunately, they are exactly the same -- This can't be the problem either.

The solution

With all the attempts above failed, I decided that I should compare the request of a successful delivery with the failed one. I tried to toot something on pawoo and then toot something on cap.moe, while I kept tcpdump running. After this, I fed them to Wireshark as usual and followed the individual HTTP streams. The Siganture header drew my attention.

This is the header in the failed request

Signature: keyId="https://cap.moe/users/PeterCxy#main-key",algorithm="rsa-sha256",headers="(request-target) user-agent host date accept-encoding digest content-type",signature="ZC4c0wxPRn+RVYTeAaPjEgA3PDW/jHQ3CdUSn3u+mH2HUxsiQV3TV0dObzC4Z9VGOmY0ZE0cbQ9KiketDxPAq99InDnDjJ49aUT6/L0gSXJQlpM4SGGT8VyipkFm/dzoxbJ8jiT9WjcrXwD1/sJV4IvuA0LJs96mRkuexykguSu2PefvS7PTw5ufAxGTWn3YmtvkMeYLBi5V7LUz3xcONe2iqcSO6hKZ77puTvvWJZgfeNxMyoRXyrcrKUSUZhgfR8z7rwPgxvcoigfiL/SH0xrKyBIdO6HjjjuMsTOSa4xRsrGgopowpAx19ya83YiTRdvkO720u3Dy3ZsWifoRCw=="

And in the successful request from pawoo

Signature: keyId="https://pawoo.net/users/PeterCxy#main-key",algorithm="rsa-sha256",headers="(request-target) user-agent host date digest content-type",signature="Esf8TAlrYId7XhP7AKlRdGTz+tWXT+/ehYCrCLKCgx3UWPxnzNBssawr7oG5xPuB1QU/TLw6M09Rp9pd+0+F20GaEVUE2UTLNwKDizDbEj2XmK7RjEE4ys3Md1b8E+d4YbTVnUWqi0WnufUNTrjLCdyPCPHn3fqJ5Bv9/W4aUDF+nFbJAZr2n1cmu6Nb28nhS1PQAz7AzzsZy/Du+R6S3x91OjRMIa7Xt1EgLWH6/TEchUsxiP78QKZIbzIlEca+BhWCQiQ2qjO+VtwNDDypqh9HheNn23iuy4xm6hKwjHiVVkfekbEK47fNRXH5fakhmHmN7Zl813lrotkIGbDrdA=="

Notice that the headers in the failed signature indicated that the accept-encoding header is also signed, while it was absent in the successful request.

Now I knew what was wrong with the Mastodon stuff: I erased the Accept-Encoding header in my Nginx reverse proxy configuration! This was due to the use of sub_filter, since I needed to insert something into the HTML of Mastodon while I was too lazy to modify the source code and re-build the Docker image myself.

The solution seems easy now. Originally, my Nginx configuration included

proxy_set_header Accept-Encoding "";

Since I do still want to use sub_filter for HTML pages, I changed it to

set $my_encoding $http_accept_encoding;
if ($http_content_type != "application/activity+json") {
  set $my_encoding "";
}
proxy_set_header Accept-Encoding $my_encoding;

This erases the Accept-Encoding header except when the content type is application/activity+json, which is used to communicate between Mastodon nodes.

Save and reload the Nginx configuration, everything works fine now.

The cause and more questions

After asking the maintainer of Mastodon, @Gargron@mastodon.social, I figured out where was this problem introduced:

https://github.com/tootsuite/mastodon/pull/7425/commits/4de98db0312de2a45d8f08d6f6611ebc64eed8b1

This pull request added direct support of gzip compression in Mastodon, thus bringing the Accept-Encoding header into the signature. My erasure of this header, obviously, broke the signature check and made all of these happen.

However, these questions are still not answered after all of these:

  1. Why am I only losing federation with some 2.4.0 instances but not all? The pull request seemed to be enabled by default and there should be no way to disable it.
  2. What's the point of including this header in the signature?

I couldn't find the answer on my own, and I decided not to because nothing is wrong now.

And that's it, the process of troubleshooting a mysterious bug.

"Blocklists"

There just really can't be any idea worse than blocklists.

As a Mastodon instance administrator, I've seen the growth and popularization of Mastodon as a decentralized social media, especially after the recent case of data leakage of Facebook. This can't be a better phenomenon as to us, since we have always hoped that people will one day wake up from the dream that large entities, such as governments and companies, would ever protect their freedom and / or privacy. However, while the amount of users and administrators of Mastodon increases, unexpected things also happen, due to the fact that some of the users just followed others to join Mastodon without knowing what they are actually doing. One of these is the emergence of Mastodon blocklists.

I saw such blocklist for the first time on a Mastodon post, which was published as an artical on Telegraph [1]. To be honest, it was really disturbing to me at the first sight, because I was not expecting this to happen so soon on Mastodon -- I was just talking about the possibility of such things happening on Mastodon with my friend that morning. Not surprisingly, this blocklist is, just like every other blocklists I've seen, full of personal prejudice and unjustified / unclear criteria. What's more disturbing is that people are actually requesting Mastodon to introduce auto-subscription to these blocklists [2], with unmanned scripts to download and apply every line in the blocklists published by some unknown and maybe prejudiced guy.

To make it clear, I am personally totally fine with the idea of doamin blocks / account blocks that is present in Mastodon for a long time. These are essential tools for some Mastodon instances to be legal, because instances have different values and different applicable laws. To maintain federation, these differences must be respected. What I am entirely against is to brainlessly take some random guy's blocklist and apply them blindly to your own instance, believing that the list completely correspond to your own value, and thinking that you have avoided a lot of extra work of blocking SPAM / Child Porn / ... instances and accounts.

Once people got the power of "control", they're making there own place where they escape from before, there is nothing new under the sun.

This was the response from my friend @AstroProfundis on this issue.

Truly, there is nothing new under the sun. It has not been long after the case that an activitist on Twitter was blocked by a popular blocklist that everyone just blindly follows [3], and people are fleeing from Twitter and Facebook for their overwhelmingly centralized power, and now people are again building their own centralized kindoms using blocklists, pretending that every instance is still independent even when they are using the same list of blocked users and domains. Well, unless you call them federate laws.

What are we hoping from a federated social media in the first place? Think about it. To me, it's the ability to scatter users into different instances with diverse values and views of the world. It's the possibility that if several instances are compromised or act against what users want, they can simply switch to the others and still get the same happy life as before. It's also the opportunity that every minority group can have their voice conveyed through the entire Fediverse. Sure, instances can each have their own rules of blocking, but they will never affect the Fediverse as a whole, and, as I personally believe, there will never be a consensus so wide that most of the instances will block a particular group of people. And, our lovely well-crafted blocklists will completely ruin these.

I've set up my own e-mail server before, which is a federated protocol with an idea similar to Mastodon, and what I discovered is that, with the blocklists, one will be essentially prevented from doing so if he / she wants the e-mails to be delivered properly to most of the e-mail hosts. These lists, by trusting popular IPs and distrusting unpopular ones, are essentially favoring gigantic hosts that owns the resources to perform complex machine-learning based fancy filtering algorithms on their outgoing e-mails. (Or even filter the outgoing e-mails by hand? Huh.) Moreover, once blocked, the process of disputing and unblocking will be overwhelmingly hard and complex for any individual e-mail host to get through. Yes, there are multiple lists following seemingly different standards. Yes, there are ways you could get yourself unblocked providing that proper justification is given. Will these make any difference? No. Even North Korea says that its people can put up disputes against their jurisdictional decisions -- despite the fact that this would never work.

I really hope that there will be some study on how much of these blocklists reflect their criteria written on paper, without much prejudice. Since there has been none, I can only conclude from my personal experience that such blocklists tend to become prejudiced while growing. This also includes a blockbot that is present recently in the Chinese community of Telegram users, which blocked a bunch of innocent people just for their ideas being in conflict with the maintainer's. Our lovely followers of this bot, without knowing anything, blocked such people from every controllable group.

Blocking is a destructive operation. It should be the last resort following failure to communicate, rather than something to be automated and to be blindly followed. If the maintainers of blocklists call them Hatelists, I will be completely fine for them, since by doing so they are actively informing people that this will include personal ideas, and this is not something to be subscribed to without further thinking. As long as they are still called Blocklists, I would say a big, big "NO" to them.

Dear Mastodon administrators, please always remember that, unless you share the same value with the maintainers of blocklists now, forever and for all the possible foreseeable future, do think twice before you follow someone to block a domain or a user. Do not ruin the Fediverse by your own hands.

Because I really don't know what will be the next Mastodon Fediverse to go to.

References

  1. Blockchain Blocklist Advisory
  2. PR #7059: Domain blocking as rake task
  3. When do Twitter block lists start infringing on free speech?