In the process of developing firmware for Linux routers, “interesting” bugs come across from time to time. Those from which after three days you want to tear your hair in the most unexpected places. One of these problems will be discussed further.
Most of our routers have an LTE modem on board. Some are just two. There was just such a case. New routers appear quite often, so the technology has been worked out: once, we checked that everything turns on, two, we adapted the device tree, three, we give it to testers to search for anything interesting.
And they, of course, found: two modems do not work at the same time. How so? They worked for me! It turns out that if two quectel, then do not work. In whatever order they are turned on, the first one starts normally, and the second one in EDL mode. Moreover, since the EDL mode starts much faster (less than a second, versus 10 seconds), it looks like the second one starts first. The problem is repeated with different models from this manufacturer. As if the second modem connects to usb, looks around, sees the first one and says: “There is already a quectel modem on this bus. It is not supposed to have two quectel modems on the same bus.” Looks like this:
[ 40.828935] usb 3-1: new high-speed USB device number 2 using xhci-hcd [ 41.084049] qcserial 3-1:1.0: Qualcomm USB modem converter detected [ 41.084719] usb 3-1: Qualcomm USB modem converter now attached to ttyUSB0 [ 42.388984] usb 1-1: new high-speed USB device number 2 using xhci-hcd [ 43.462712] option 1-1:1.0: GSM modem (1-port) converter detected [ 43.463214] usb 1-1: GSM modem (1-port) converter now attached to ttyUSB1 [ 43.464313] option 1-1:1.1: GSM modem (1-port) converter detected [ 43.464875] usb 1-1: GSM modem (1-port) converter now attached to ttyUSB2 [ 43.466322] option 1-1:1.2: GSM modem (1-port) converter detected [ 43.466983] usb 1-1: GSM modem (1-port) converter now attached to ttyUSB3 [ 43.468368] option 1-1:1.3: GSM modem (1-port) converter detected [ 43.469121] usb 1-1: GSM modem (1-port) converter now attached to ttyUSB4 [ 43.590363] qmi_wwan 1-1:1.4: cdc-wdm0: USB WDM device [ 43.592423] qmi_wwan 1-1:1.4 wwan0: register 'qmi_wwan' at usb-xhci-hcd.2.auto-1, WWAN/QMI device, f6:0c:68:0e:11:4d
a special modem mode from qualcomm for unraveling complex cases. Stands for Emergency DownLoad. It is present in all android phones on chips from qualcom, but usually it is impossible to get there by regular means.
…the second modem connects to usb, looks around, sees the first one…
it’s irony. In fact, the USB protocol does not allow devices anywhere to look around and see anything other than the USB host.
Trying a different router model. In parallel, a colleague launches a couple on PC. If the PC does not work either, then the question is already to quectel. There is no problem on the other router. Apparently, it’s not the modems themselves. Either there is not enough energy, or it’s the usb or port wiring. On the problematic router, the modems are connected directly to the root hubs, and on the other, there is one more hub between the root hub and each of the modems. While a colleague is checking two modems connected directly to the same hub on a PC, I’ll call the electricity.
Modems are inserted into mpci-e connectors. Excellent socket. Everyone is good, only power is on pins No. 2, 24 and 52. And all the even-numbered contacts are on the bottom side of the connector and are hidden under the modem when it is connected. You have to look elsewhere. There must be some resistors or capacitors through which this power line passes. Let’s take a look at the board layout. In the meantime, a colleague reported that the problem does not appear on the PC. Power also turned out to be normal (3.3 V on both connectors). Apparently the problem is in the software.
Let’s try to disable all programmatic processing in user space. For, more and more it seems that this is a hardware problem. Disabling software processing did not solve the problem. Let’s see if the power is turned on and if it doesn’t help, we’ll “dive” into the kernel. By default, the modems turn on one after the other. Let’s add a delay of 10 seconds. It doesn’t help either. In the kernel logs there is some message like
alloc_contig_range [x, y] RFNs busy. The message seems to indicate problems with the allocation of a sequential piece of memory, which may be the source of the problem. Maybe there is not enough memory for descriptors. And in front of the second modem it is much larger. Maybe the problem is somehow related to the allocated memory
In sysfs it turned out that the number of buffers URB different (in /sys/bus/usb/devices/
alloc_contig_range called (through the chain) from
dma_alloc_coherent. Find all challenges
dma_alloc in the usb subsystem (driver/usb) and put a debug print there. found
usb_alloc_dev in drivers/usb/core/usb.c. All posts”
alloc_contig_range“they come from there. There is called
xhci_alloc_dev. From which is called
xhci_alloc_virt_device. Inside, however, the code looks quite innocent, so it begins to give the impression that informational messages from
alloc_contig_range is a false trail. Nevertheless, this version must be worked out to the end.
Interesting. Inside half-way called
xhci_ring_alloc, which fires twice for the bad case and 14 times for the good case. But here it is called from somewhere else. Another place turned out
usb_hcd_alloc_bandwidth. Which checks if there is enough bus bandwidth for the new device. But at the same time, the structures of the current usb settings and configurations are transferred to it. That is, it looks like it checks that it is possible to enable the requested settings. This is also hinted at by the comment “check whether a new bandwidth setting exceeds the bus bandwidth”. On the other hand, it’s strange that a function allocates memory to test something.
It turned out that for both modems the function returns 0, which is interpreted as success. But a different amount of memory is allocated. We need to check what is in this structure
usb_host_config. And there “representation of a device’s configuration”, that is, a description of the device settings (and not the host, as some might think). In simple terms, all descriptors collected from the device are collected in this structure.
Each USB device can work in one of several configurations (most often one). The configuration contains a list of interfaces. Accordingly, including this or that configuration, its list of interfaces is included. Each interface consists of access points (endpoint). Any USB transaction goes from or to some point on the device. All this stuff is described in descriptors. The first thing a host does when a device connects is to download the handles.
In general, there seems to be no bug here. Let’s see what the host receives from the device at startup using tcpdump. Let’s add the usbmon driver to the kernel and rebuild libpcap with usb support. Let’s listen to what happens on the bus when modems are connected. Turned on, collected
insmod usbmon.ko;tcpdump -D shows the treasured usbmon0, usbmon1 and so on. We remove the trace when adding the first and second modems and compare it in wireshark. The sequence turns out to be quite simple: an interrupt comes, a port reset is initiated, then the host reads the descriptors. A problematic modem (or root hub) manifests itself quite early: during the initial reset of the port, you have to do it twice, and already the first descriptors show that this was enough for the device to initialize incorrectly. After resetting the port (SET_FEATURE PORT_RESET), the status is read there and the controller driver is apparently not happy with the status that he read. So he makes another reset. Several repeated tests have shown that the double reset is due to the imperfection of the test methodology: shoving the modem into the port by hand does not always go smoothly. Otherwise, the initialization procedures are identical. In general, here is a dead end.
This means that if the difference is already in the very first descriptors, then the modem already knows at startup that it should not start normally. He can receive this knowledge either by radio or by wire. Radio is probably the wrong option. Let’s try wires. A cursory googling shows that in order to put the qualcomm chip into EDL mode, you need to ground some leg. Perhaps when one modem is started, some leg on the second is parasitic grounded or something like that? We stick the modem into the first connector and check all the legs on the second connector when it is turned off, then when the modem is on, and voila! In the second case, pins 3 and 5 show a level of 1.8 V. On the board diagram, the pins are labeled coex_1 and coex_2. We look at the modem docks: reserved. What the …?! It turned out to be an old version. The new one says COEX_UART_RX and COEX_UART_TX. And the note “It is prohibited to be pulled up high before startup”. That is, when the first modem starts, it pulls up these pins (as it should be for UART), and the second one, seeing such indecency, panics and starts in EDL mode.
We solder the resistors and drink champagne. The epic of 2 days is over