331202 |
19-Mar-2018 |
ae |
MFC r330792: Do not try to reassemble IPv6 fragments in "reass" rule.
ip_reass() expects IPv4 packet and will just corrupt any IPv6 packets that it gets. Until proper IPv6 fragments handling function will be implemented, pass IPv6 packets to next rule.
PR: 170604 |
325731 |
12-Nov-2017 |
truckman |
MFC r325008
Fix Dummynet AQM packet marking function ecn_mark() and fq_codel / fq_pie schedulers packet classification functions in layer2 (bridge mode).
Dummynet AQM packet marking function ecn_mark() and fq_codel/fq_pie schedulers packet classification functions (fq_codel_classify_flow() and fq_pie_classify_flow()) assume mbuf is pointing at L3 (IP) packet. However, this assumption is incorrect if ipfw/dummynet is used to manage layer2 traffic (bridge mode) since mbuf will point at L2 frame. This patch solves this problem by identifying the source of the frame/packet (L2 or L3) and adding ETHER_HDR_LEN offset when converting an mbuf pointer to ip pointer if the traffic is from layer2. More specifically, in dummynet packet tagging function, tag_mbuf(), iphdr_off is set to ETHER_HDR_LEN if the traffic is from layer2 and set to zero otherwise. Whenever an access to IP header is required, mtodo(m, dn_tag_get(m)->iphdr_off) is used instead of mtod(m, struct ip *) to correctly convert mbuf pointer to ip pointer in both L2 and L3 traffic.
Submitted by: lstewart Relnotes: yes Differential Revision: https://reviews.freebsd.org/D12506 |
325230 |
31-Oct-2017 |
ae |
MFC r324947: Add IPv6 support for O_TCPDATALEN opcode.
PR: 222746 |
321873 |
01-Aug-2017 |
philip |
MFC r320941: Fix GRE over IPv6 tunnels with IPFW
Previously, GRE packets in IPv6 tunnels would be dropped by IPFW (unless net.inet6.ip6.fw.deny_unknown_exthdrs was unset).
PR: 220640 Submitted by: Kun Xie <kxie@xiplink.com> |
318905 |
25-May-2017 |
truckman |
MFC r318527
Fix the queue delay estimation in PIE/FQ-PIE when the timestamp (TS) method is used. When packet timestamp is used, the "current_qdelay" keeps storing the last queue delay value calculated in the dequeue function. Therefore, when a burst of packets arrives followed by a pause, the "current_qdelay" will store a high value caused by the burst and stick to that value during the pause because the queue delay measurement is done inside the dequeue function. This causes the drop probability calculation function to calculate high drop probability value instead of zero and prevents the burst allowance mechanism from working properly. Fix this problem by resetting "current_qdelay" inside the drop probability calculation function when the queue length is zero and TS option is used.
Submitted by: Rasool Al-Saadi <ralsaadi@swin.edu.au> |
318886 |
25-May-2017 |
truckman |
MFC r318511
The result of right shifting a negative signed value is implementation defined. On machines without arithmetic shift instructions, zero bits may be shifted in from the left, giving a large positive result instead of the desired divide-by power-of-2. Fix this by operating on the absolute value and compensating for the possible negation later.
Reverse the order of the underflow/overflow tests and the exponential decay calculation to avoid the possibility of an erroneous overflow detection if p is a sufficiently small non-negative value. Also check for negative values of prob before doing the exponential decay to avoid another instance of of right shifting a negative value.
Tested by: Rasool Al-Saadi <ralsaadi@swin.edu.au> |
318155 |
10-May-2017 |
marius |
MFC: r311817
In dummynet(4), random chunks of memory are casted to struct dn_*, potentially leading to fatal unaligned accesses on architectures with strict alignment requirements. This change fixes dummynet(4) as far as accesses to 64-bit members of struct dn_* are concerned, tripping up on sparc64 with accesses to 32-bit members happening to be correctly aligned there. In other words, this only fixes the tip of the iceberg; larger parts of dummynet(4) still need to be rewritten in order to properly work on all of !x86. In principle, considering the amount of code in dummynet(4) that needs this erroneous pattern corrected, an acceptable workaround would be to declare all struct dn_* packed, forcing compilers to do byte-accesses as a side-effect. However, given that the structs in question aren't laid out well either, this would break ABI/KBI. While at it, replace all existing bcopy(9) calls with memcpy(9) for performance reasons, as there is no need to check for overlap in these cases.
PR: 189219 |
317489 |
27-Apr-2017 |
truckman |
MFC r316777 (by cem)
dummynet: Use strlcpy to appease static checkers
Some dummynet modules used strcpy() to copy from a larger buffer (dn_aqm->name) to a smaller buffer (dn_extra_parms->name). It happens that the lengths of the strings in the dn_aqm buffers were always hardcoded to be smaller than the dn_extra_parms buffer ("CODEL", "PIE").
Use strlcpy() instead, to appease static checkers. No functional change.
Reported by: Coverity CIDs: 1356163, 1356165 Sponsored by: Dell EMC Isilon |
316325 |
31-Mar-2017 |
truckman |
MFC r315516
Change several constants used by the PIE algorithm from unsigned to signed.
- PIE_MAX_PROB is compared to variable of int64_t and the type promotion rules can cause the value of that variable to be treated as unsigned. If the value is actually negative, then the result of the comparsion is incorrect, causing the algorithm to perform poorly in some situations. Changing the constant to be signed cause the comparision to work correctly.
- PIE_SCALE is also compared to signed values. Fortunately they are also compared to zero and negative values are discarded so this is more of a cosmetic fix.
- PIE_DQ_THRESHOLD is only compared to unsigned values, but it is small enough that the automatic promotion to unsigned is harmless.
Submitted by: Rasool Al-Saadi <ralsaadi@swin.edu.au> |
314667 |
04-Mar-2017 |
avg |
MFC r283291: don't use CALLOUT_MPSAFE with callout_init()
The main purpose of this MFC is to reduce conflicts for other merges. Parts of the original change have already "trickled down" via individual MFCs. |
313726 |
14-Feb-2017 |
ngie |
MFC r313356:
Fix typos in comments (returing -> returning) |
302987 |
18-Jul-2016 |
truckman |
MFC r302667
Fix problems in the FQ-PIE AQM cleanup code that could leak memory or cause a crash.
Because dummynet calls pie_cleanup() while holding a mutex, pie_cleanup() is not able to use callout_drain() to make sure that all callouts are finished before it returns, and callout_stop() is not sufficient to make that guarantee. After pie_cleanup() returns, dummynet will free a structure that any remaining callouts will want to access.
Fix these problems by allocating a separate structure to contain the data used by the callouts. In pie_cleanup(), call callout_reset_sbt() to replace the normal callout with a cleanup callout that does the cleanup work for each sub-queue. The instance of the cleanup callout that destroys the last flow will also free the extra allocated block of memory. Protect the reference count manipulation in the cleanup callout with DN_BH_WLOCK() to be consistent with all of the other usage of the reference count where this lock is held by the dummynet code.
Submitted by: Rasool Al-Saadi <ralsaadi@swin.edu.au> Differential Revision: https://reviews.freebsd.org/D7174 |
302422 |
08-Jul-2016 |
truckman |
MFC r302338
Fix a race condition between the main thread in aqm_pie_cleanup() and the callout thread that can cause a kernel panic. Always do the final cleanup in the callout thread by passing a separate callout function for that task to callout_reset_sbt().
Protect the ref_count decrement in the callout with DN_BH_WLOCK(). All other ref_count manipulation is protected with this lock.
There is still a tiny window between ref_count reaching zero and the end of the callout function where it is unsafe to unload the module. Fixing this would require the use of callout_drain(), but this can't be done because dummynet holds a mutex and callout_drain() might sleep.
Remove the callout_pending(), callout_active(), and callout_deactivate() calls from calculate_drop_prob(). They are not needed because this callout uses callout_init_mtx().
Submitted by: Rasool Al-Saadi <ralsaadi@swin.edu.au> Differential Revision: https://reviews.freebsd.org/D6928 |
301772 |
10-Jun-2016 |
truckman |
MFC r300779, r300781, r300783, r300784, r300949, r301162, r301180
r300779 | truckman | 2016-05-26 14:40:13 -0700 (Thu, 26 May 2016) | 64 lines
Import Dummynet AQM version 0.2.1 (CoDel, FQ-CoDel, PIE and FQ-PIE).
Centre for Advanced Internet Architectures
Implementing AQM in FreeBSD
* Overview <http://caia.swin.edu.au/freebsd/aqm/index.html>
* Articles, Papers and Presentations <http://caia.swin.edu.au/freebsd/aqm/papers.html>
* Patches and Tools <http://caia.swin.edu.au/freebsd/aqm/downloads.html>
Overview
Recent years have seen a resurgence of interest in better managing the depth of bottleneck queues in routers, switches and other places that get congested. Solutions include transport protocol enhancements at the end-hosts (such as delay-based or hybrid congestion control schemes) and active queue management (AQM) schemes applied within bottleneck queues.
The notion of AQM has been around since at least the late 1990s (e.g. RFC 2309). In recent years the proliferation of oversized buffers in all sorts of network devices (aka bufferbloat) has stimulated keen community interest in four new AQM schemes -- CoDel, FQ-CoDel, PIE and FQ-PIE.
The IETF AQM working group is looking to document these schemes, and independent implementations are a corner-stone of the IETF's process for confirming the clarity of publicly available protocol descriptions. While significant development work on all three schemes has occured in the Linux kernel, there is very little in FreeBSD.
Project Goals
This project began in late 2015, and aims to design and implement functionally-correct versions of CoDel, FQ-CoDel, PIE and FQ_PIE in FreeBSD (with code BSD-licensed as much as practical). We have chosen to do this as extensions to FreeBSD's ipfw/dummynet firewall and traffic shaper. Implementation of these AQM schemes in FreeBSD will: * Demonstrate whether the publicly available documentation is sufficient to enable independent, functionally equivalent implementations
* Provide a broader suite of AQM options for sections the networking community that rely on FreeBSD platforms
Program Members:
* Rasool Al Saadi (developer)
* Grenville Armitage (project lead)
Acknowledgements:
This project has been made possible in part by a gift from the Comcast Innovation Fund.
Submitted by: Rasool Al-Saadi <ralsaadi@swin.edu.au> X-No objection: core MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D6388
[Remove some code that was added to the mq_append() inline function in HEAD by r258457, which was not merged to stable/10. The AQM patch moved mq_append() from ip_dn_io.c to the new file ip_dn_private.h, so we need to remove that copy of the r258457 changes.] ------------------------------------------------------------------------ r300781 | truckman | 2016-05-26 14:44:52 -0700 (Thu, 26 May 2016) | 7 lines
Modify BOUND_VAR() macro to wrap all of its arguments in () and tweak its expression to work on powerpc and sparc64 (gcc compatibility).
Correct a typo in a nearby comment.
MFC after: 2 weeks (with r300779)
------------------------------------------------------------------------ r300783 | truckman | 2016-05-26 15:03:28 -0700 (Thu, 26 May 2016) | 4 lines
Correct a typo in a comment.
MFC after: 2 weeks (with r300779)
------------------------------------------------------------------------ r300784 | truckman | 2016-05-26 15:07:09 -0700 (Thu, 26 May 2016) | 5 lines
Include the new AQM files when compiling a kernel with options DUMMYNET.
Reported by: Nikolay Denev <nike_d AT cytexbg DOT com> MFC after: 2 weeks (with r300779)
------------------------------------------------------------------------ r300949 | truckman | 2016-05-29 00:23:56 -0700 (Sun, 29 May 2016) | 10 lines
Cast some expressions that multiply a long long constant by a floating point constant to int64_t. This avoids the runtime conversion of the the other operand in a set of comparisons from int64_t to floating point and doing the comparisions in floating point.
Suggested by: lidl Submitted by: Rasool Al-Saadi <ralsaadi@swin.edu.au> MFC after: 2 weeks (with r300779)
------------------------------------------------------------------------ r301162 | truckman | 2016-06-01 13:04:24 -0700 (Wed, 01 Jun 2016) | 9 lines
Replace constant expressions that contain multiplications by fractional floating point values with integer divides. This will eliminate any chance that the compiler will generate code to evaluate the expression using floating point at runtime.
Suggested by: bde Submitted by: Rasool Al-Saadi <ralsaadi@swin.edu.au> MFC after: 8 days (with r300779 and r300949)
------------------------------------------------------------------------ r301180 | truckman | 2016-06-01 17:42:15 -0700 (Wed, 01 Jun 2016) | 2 lines
Belatedly bump .Dd date for Dummynet AQM import in r300779.
Relnotes: yes |
301231 |
03-Jun-2016 |
truckman |
MFC r266941, r266955
Needed for anticipated dummynet AQM MFC next week.
r266941 | hiren | 2014-06-01 00:28:24 -0700 (Sun, 01 Jun 2014) | 9 lines
ECN marking implenetation for dummynet. Changes include both DCTCP and RFC 3168 ECN marking methodology.
DCTCP draft: http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00
Submitted by: Midori Kato (aoimidori27@gmail.com) Worked with: Lars Eggert (lars@netapp.com) Reviewed by: luigi, hiren
r266955 | hiren | 2014-06-01 13:19:17 -0700 (Sun, 01 Jun 2014) | 5 lines
DNOLD_IS_ECN introduced by r266941 is not required. DNOLD_* flags are for compat with old binaries.
Suggested by: luigi
Discussed with: hiren Relnotes: yes |
297228 |
24-Mar-2016 |
hselasky |
MFC r292254:
Properly drain callouts in the IPFW subsystem to avoid use after free panics when unloading the dummynet and IPFW modules:
- The callout drain function can sleep and should not be called having a non-sleepable lock locked. Remove locks around "ipfw_dyn_uninit(0)".
- Add a new "dn_gone" variable to prevent asynchronous restart of dummynet callouts when unloading the dummynet kernel module.
- Call "dn_reschedule()" locked so that "dn_gone" can be set and checked atomically with regard to starting a new callout.
PR: 208171 Requested by: Franco Fichtner (opnsense.org) Differential Revision: https://reviews.freebsd.org/D3855 |
296649 |
11-Mar-2016 |
ae |
MFC r296348: Use correct size for malloc. |
296311 |
02-Mar-2016 |
ae |
MFC r295969: Fix bug in filling and handling ipfw's O_DSCP opcode. Due to integer overflow CS4 token was handled as BE.
PR: 207459 Approved by: re (gjb) |
291772 |
04-Dec-2015 |
bdrewery |
MFC r291001:
ipfw: Fix dynamic IPv6 rules showing junk for non-specified address masks.
Relnotes: yes |
287963 |
18-Sep-2015 |
melifaro |
MFC r266310
Fix wrong formatting of 0.0.0.0/X table records in ipfw(8).
Add `flags` u16 field to the hole in ipfw_table_xentry structure. Kernel has been guessing address family for supplied record based on xent length size. Userland, however, has been getting fixed-size ipfw_table_xentry structures guessing address family by checking address by IN6_IS_ADDR_V4COMPAT().
Fix this behavior by providing specific IPFW_TCF_INET flag for IPv4 records.
PR: bin/189471,kern/200169 |
266678 |
26-May-2014 |
ae |
MFC r266399: Since ipfw nat configures all options in one step, we should set all bits in the mask when calling LibAliasSetMode() to properly clear unneeded options.
PR: 189655 |
265700 |
08-May-2014 |
melifaro |
Merge r258708, r258711, r260247, r261117.
r258708: Check ipfw table numbers in both user and kernel space before rule addition. Found by: Saychik Pavel <umka@localka.net>
r258711: Simplify O_NAT opcode handling.
r260247: Use rnh_matchaddr instead of rnh_lookup for longest-prefix match. rnh_lookup is effectively the same as rnh_matchaddr if called with empy network mask.
r261117: Reorder struct ip_fw_chain: * move rarely-used fields down * move uh_lock to different cacheline * remove some usused fields |
265227 |
02-May-2014 |
trociny |
MFC r264963:
Define startup order the same way as it is in dummynet. |
264813 |
23-Apr-2014 |
ae |
MFC r264540: Set oif only for outgoing packets.
PR: 188543 |
264804 |
23-Apr-2014 |
brueffer |
MFC: r264421
Free resources in error cases; re-indent a curly brace while here.
CID: 1199366 Found with: Coverity Prevent(tm) |
263680 |
24-Mar-2014 |
glebius |
Merge r263497: fix ipfw + VIMAGE sysctls.
PR: kern/187665 |
263086 |
12-Mar-2014 |
glebius |
Bulk sync of pf changes from head, in attempt to fixup broken build I made in r263029.
Merge r257186,257215,257349,259736,261797.
These changesets split pfvar.h into several smaller headers and make userland utilities to include only some of them. |
262210 |
19-Feb-2014 |
dim |
MFC r261915:
Under sys/netpfil/ipfw, surround two IPv6-specific static functions with #ifdef INET6, since they are unused when INET6 is disabled. |
258912 |
04-Dec-2013 |
rodrigc |
MFC r258588
In sys/netpfil/ipfw/ip_fw_nat.c:vnet_ipfw_nat_uninit() we call "IPFW_WLOCK(chain);". This lock gets deleted in sys/netpfil/ipfw/ip_fw2.c:vnet_ipfw_uninit().
Therefore, vnet_ipfw_nat_uninit() *must* be called before vnet_ipfw_uninit(), but this doesn't always happen, because the VNET_SYSINIT order is the same for both functions. In sys/net/netpfil/ipfw/ip_fw2.c and sys/net/netpfil/ipfw/ip_fw_nat.c, IPFW_SI_SUB_FIREWALL == IPFW_NAT_SI_SUB_FIREWALL == SI_SUB_PROTO_IFATTACHDOMAIN and IPFW_MODULE_ORDER == IPFW_NAT_MODULE_ORDER
Consequently, if VIMAGE is enabled, and jails are created and destroyed, the system sometimes crashes, because we are trying to use a deleted lock.
To reproduce the problem: (1) Take a GENERIC kernel config, and add options for: VIMAGE, WITNESS, INVARIANTS. (2) Run this command in a loop: jail -l -u root -c path=/ name=foo persist vnet && jexec foo ifconfig lo0 127.0.0.1/8 && jail -r foo
(see http://lists.freebsd.org/pipermail/freebsd-current/2010-November/021280.html )
Fix the problem by increasing the value of IPFW_NAT_SI_SUB_FIREWALL, so that vnet_ipfw_nat_uninit() runs after vnet_ipfw_uninit().
Approved by: re (gjb) |
256281 |
10-Oct-2013 |
gjb |
Copy head (r256279) to stable/10 as part of the 10.0-RELEASE cycle.
Approved by: re (implicit) Sponsored by: The FreeBSD Foundation
|
255928 |
28-Sep-2013 |
philip |
Use the correct EtherType for logging IPv6 packets.
Reviewed by: melifaro Approved by: re (kib, glebius) MFC after: 3 days
|
254781 |
24-Aug-2013 |
mav |
Make dummynet use new direct callout(9) execution mechanism. Since the only thing done by the dummynet handler is taskqueue_enqueue() call, it doesn't need extra switch to the clock SWI context.
On idle system this change in half reduces number of active CPU cycles and wakes up only one CPU from sleep instead of two.
I was going to make this change much earlier as part of calloutng project, but waited for better solution with skipping idle ticks to be implemented. Unfortunately with 10.0 release coming it is better get at least this.
|
254776 |
24-Aug-2013 |
trociny |
Make ipfw nat init/unint work correctly for VIMAGE:
* Do per vnet instance cleanup (previously it was only for vnet0 on module unload, and led to libalias leaks and possible panics due to stale pointer dereferences).
* Instead of protecting ipfw hooks registering/deregistering by only vnet0 lock (which does not prevent pointers access from another vnets), introduce per vnet ipfw_nat_loaded variable. The variable is set after hooks are registered and unset before they are deregistered.
* Devirtualize ifaddr_event_tag as we run only one event handler for all vnets.
* It is supposed that ifaddr_change event handler is called in the interface vnet context, so add an assertion.
Reviewed by: zec MFC after: 2 weeks
|
250246 |
04-May-2013 |
melifaro |
Use unified method for accessing / updating cached rule pointers.
MFC after: 2 weeks
|
250131 |
01-May-2013 |
eadler |
Correct a few sizeof()s
Submitted by: swildner@DragonFlyBSD.org Reviewed by: alfred
|
250039 |
29-Apr-2013 |
glebius |
Remove useless ifdef KLD_MODULE from dummynet module unload path. This fixes panic on unload.
Reported by: pho
|
249925 |
26-Apr-2013 |
glebius |
Add const qualifier to the dst parameter of the ifnet if_output method.
|
248971 |
01-Apr-2013 |
melifaro |
Fix ipfw rule validation partially broken by r248552.
Pointed by: avg MFC with: r248552
|
248697 |
25-Mar-2013 |
ae |
When we are removing a specific set, call ipfw_expire_dyn_rules only once.
Obtained from: Yandex LLC MFC after: 1 week
|
248552 |
20-Mar-2013 |
melifaro |
Add ipfw support for setting/matching DiffServ codepoints (DSCP).
Setting DSCP support is done via O_SETDSCP which works for both IPv4 and IPv6 packets. Fast checksum recalculation (RFC 1624) is done for IPv4. Dscp can be specified by name (AFXY, CSX, BE, EF), by value (0..63) or via tablearg.
Matching DSCP is done via another opcode (O_DSCP) which accepts several classes at once (af11,af22,be). Classes are stored in bitmask (2 u32 words).
Many people made their variants of this patch, the ones I'm aware of are (in alphabetic order):
Dmitrii Tejblum Marcelo Araujo Roman Bogorodskiy (novel) Sergey Matveichuk (sem) Sergey Ryabin
PR: kern/102471, kern/121122 MFC after: 2 weeks
|
248491 |
19-Mar-2013 |
ae |
Separate the locking macros that are used in the packet flow path from others. This helps easy switch to use pfil(4) lock.
|
247626 |
02-Mar-2013 |
melifaro |
Fix callout expiring dynamic rules.
PR: kern/175530 Submitted by: Vladimir Spiridenkov <vs@gtn.ru> MFC after: 2 weeks
|
244634 |
23-Dec-2012 |
melifaro |
Add parentheses to IP_FW_ARG_TABLEARG() definition.
Suggested by: glebius MFC with: r244633
|
244633 |
23-Dec-2012 |
melifaro |
Use unified IP_FW_ARG_TABLEARG() macro for most tablearg checks. Log real value instead of IP_FW_TABLEARG (65535) in ipfw_log().
Noticed by: Vitaliy Tokarenko <rphone@ukr.net> MFC after: 2 weeks
|
243882 |
05-Dec-2012 |
glebius |
Mechanically substitute flags from historic mbuf allocator with malloc(9) flags within sys.
Exceptions:
- sys/contrib not touched - sys/mbuf.h edited manually
|
243711 |
30-Nov-2012 |
melifaro |
Use common macros for working with rule/dynamic counters. This is done as preparation to introduce per-cpu ipfw counters.
MFC after: 3 weeks
|
243707 |
30-Nov-2012 |
melifaro |
Make ipfw dynamic states operations SMP-ready.
* Global IPFW_DYN_LOCK() is changed to per-bucket mutex. * State expiration is done in ipfw_tick every second. * No expiration is done on forwarding path. * hash table resize is done automatically and does not flush all states. * Dynamic UMA zone is now allocated per each VNET * State limiting is now done via UMA(9) api.
Discussed with: ipfw MFC after: 3 weeks Sponsored by: Yandex LLC
|
242834 |
09-Nov-2012 |
melifaro |
Simplify sending keepalives. Prepare ipfw_tick() to be used by other consumers.
Reviewed by: ae(basically) MFC after: 2 weeks
|
242632 |
05-Nov-2012 |
melifaro |
Add assertion to enforce 'nat global' locking requierements changed by r241908.
Suggested by: adrian, glebius MFC after: 3 days
|
242631 |
05-Nov-2012 |
melifaro |
Use unified print_dyn_rule_flags() function for debugging messages instead of hand-made printfs in every place.
MFC after: 1 week
|
242463 |
02-Nov-2012 |
ae |
Remove the recently added sysctl variable net.pfil.forward. Instead, add protocol specific mbuf flags M_IP_NEXTHOP and M_IP6_NEXTHOP. Use them to indicate that the mbuf's chain contains the PACKET_TAG_IPFORWARD tag. And do a tag lookup only when this flag is set.
Suggested by: andre
|
242079 |
25-Oct-2012 |
ae |
Remove the IPFIREWALL_FORWARD kernel option and make possible to turn on the related functionality in the runtime via the sysctl variable net.pfil.forward. It is turned off by default.
Sponsored by: Yandex LLC Discussed with: net@ MFC after: 2 weeks
|
241913 |
22-Oct-2012 |
glebius |
Switch the entire IPv4 stack to keep the IP packet header in network byte order. Any host byte order processing is done in local variables and host byte order values are never[1] written to a packet.
After this change a packet processed by the stack isn't modified at all[2] except for TTL.
After this change a network stack hacker doesn't need to scratch his head trying to figure out what is the byte order at the given place in the stack.
[1] One exception still remains. The raw sockets convert host byte order before pass a packet to an application. Probably this would remain for ages for compatibility.
[2] The ip_input() still subtructs header len from ip->ip_len, but this is planned to be fixed soon.
Reviewed by: luigi, Maxim Dounin <mdounin mdounin.ru> Tested by: ray, Olivier Cochard-Labbe <olivier cochard.me>
|
241908 |
22-Oct-2012 |
melifaro |
Remove unnecessary chain read lock in ipfw nat 'global' code. Document case when ipfw chain lock must be held while calling ipfw_nat().
MFC after: 2 weeks
|
241610 |
16-Oct-2012 |
glebius |
Make the "struct if_clone" opaque to users of the cloning API. Users now use function calls:
if_clone_simple() if_clone_advanced()
to initialize a cloner, instead of macros that initialize if_clone structure.
Discussed with: brooks, bz, 1 year ago
|
241394 |
10-Oct-2012 |
kevlo |
Revert previous commit...
Pointyhat to: kevlo (myself)
|
241370 |
09-Oct-2012 |
kevlo |
Prefer NULL over 0 for pointers
|
241369 |
09-Oct-2012 |
kevlo |
Fix typo: s/unknow/unknown
|
241359 |
08-Oct-2012 |
glebius |
Catch up with r241245 and do not return packet back in host byte order.
|
241344 |
08-Oct-2012 |
glebius |
After r241245 it appeared that in_delayed_cksum(), which still expects host byte order, was sometimes called with net byte order. Since we are moving towards net byte order throughout the stack, the function was converted to expect net byte order, and its consumers fixed appropriately: - ip_output(), ipfilter(4) not changed, since already call in_delayed_cksum() with header in net byte order. - divert(4), ng_nat(4), ipfw_nat(4) now don't need to swap byte order there and back. - mrouting code and IPv6 ipsec now need to switch byte order there and back, but I hope, this is temporary solution. - In ipsec(4) shifted switch to net byte order prior to in_delayed_cksum(). - pf_route() catches up on r241245 changes to ip_output().
|
241245 |
06-Oct-2012 |
glebius |
A step in resolving mess with byte ordering for AF_INET. After this change:
- All packets in NETISR_IP queue are in net byte order. - ip_input() is entered in net byte order and converts packet to host byte order right _after_ processing pfil(9) hooks. - ip_output() is entered in host byte order and converts packet to net byte order right _before_ processing pfil(9) hooks. - ip_fragment() accepts and emits packet in net byte order. - ip_forward(), ip_mloopback() use host byte order (untouched actually). - ip_fastforward() no longer modifies packet at all (except ip_ttl). - Swapping of byte order there and back removed from the following modules: pf(4), ipfw(4), enc(4), if_bridge(4). - Swapping of byte order added to ipfilter(4), based on __FreeBSD_version - __FreeBSD_version bumped. - pfil(9) manual page updated.
Reviewed by: ray, luigi, eri, melifaro Tested by: glebius (LE), ray (BE)
|
240494 |
14-Sep-2012 |
glebius |
o Create directory sys/netpfil, where all packet filters should reside, and move there ipfw(4) and pf(4).
o Move most modified parts of pf out of contrib.
Actual movements:
sys/contrib/pf/net/*.c -> sys/netpfil/pf/ sys/contrib/pf/net/*.h -> sys/net/ contrib/pf/pfctl/*.c -> sbin/pfctl contrib/pf/pfctl/*.h -> sbin/pfctl contrib/pf/pfctl/pfctl.8 -> sbin/pfctl contrib/pf/pfctl/*.4 -> share/man/man4 contrib/pf/pfctl/*.5 -> share/man/man5
sys/netinet/ipfw -> sys/netpfil/ipfw
The arguable movement is pf/net/*.h -> sys/net. There are future plans to refactor pf includes, so I decided not to break things twice.
Not modified bits of pf left in contrib: authpf, ftp-proxy, tftp-proxy, pflogd.
The ipfw(4) movement is planned to be merged to stable/9, to make head and stable match.
Discussed with: bz, luigi
|
240233 |
08-Sep-2012 |
glebius |
Merge the projects/pf/head branch, that was worked on for last six months, into head. The most significant achievements in the new code:
o Fine grained locking, thus much better performance. o Fixes to many problems in pf, that were specific to FreeBSD port.
New code doesn't have that many ifdefs and much less OpenBSDisms, thus is more attractive to our developers.
Those interested in details, can browse through SVN log of the projects/pf/head branch. And for reference, here is exact list of revisions merged:
r232043, r232044, r232062, r232148, r232149, r232150, r232298, r232330, r232332, r232340, r232386, r232390, r232391, r232605, r232655, r232656, r232661, r232662, r232663, r232664, r232673, r232691, r233309, r233782, r233829, r233830, r233834, r233835, r233836, r233865, r233866, r233868, r233873, r234056, r234096, r234100, r234108, r234175, r234187, r234223, r234271, r234272, r234282, r234307, r234309, r234382, r234384, r234456, r234486, r234606, r234640, r234641, r234642, r234644, r234651, r235505, r235506, r235535, r235605, r235606, r235826, r235991, r235993, r236168, r236173, r236179, r236180, r236181, r236186, r236223, r236227, r236230, r236252, r236254, r236298, r236299, r236300, r236301, r236397, r236398, r236399, r236499, r236512, r236513, r236525, r236526, r236545, r236548, r236553, r236554, r236556, r236557, r236561, r236570, r236630, r236672, r236673, r236679, r236706, r236710, r236718, r237154, r237155, r237169, r237314, r237363, r237364, r237368, r237369, r237376, r237440, r237442, r237751, r237783, r237784, r237785, r237788, r237791, r238421, r238522, r238523, r238524, r238525, r239173, r239186, r239644, r239652, r239661, r239773, r240125, r240130, r240131, r240136, r240186, r240196, r240212.
I'd like to thank people who participated in early testing:
Tested by: Florian Smeets <flo freebsd.org> Tested by: Chekaluk Vitaly <artemrts ukr.net> Tested by: Ben Wilber <ben desync.com> Tested by: Ian FREISLICH <ianf cloudseed.co.za>
|
240099 |
04-Sep-2012 |
melifaro |
Introduce new link-layer PFIL hook V_link_pfil_hook. Merge ether_ipfw_chk() and part of bridge_pfil() into unified ipfw_check_frame() function called by PFIL. This change was suggested by rwatson? @ DevSummit.
Remove ipfw headers from ether/bridge code since they are unneeded now.
Note this thange introduce some (temporary) performance penalty since PFIL read lock has to be acquired for every link-level packet.
MFC after: 3 weeks
|
239997 |
01-Sep-2012 |
eadler |
Mark the ipfw interface type as not being ether. This fixes an issue where uuidgen tried to obtain a ipfw device's mac address which was always zero.
PR: 170460 Submitted by: wxs Reviewed by: bdrewery Reviewed by: delphij Approved by: cperciva MFC after: 1 week
|
239124 |
07-Aug-2012 |
luigi |
s/lenght/length/ in comments
|
239093 |
06-Aug-2012 |
luigi |
move functions outside the SYSBEGIN/SYSEND block
(SYSBEGIN/SYSEND are specific to ipfw/dummynet and are used to emulate sysctl on platforms that do not have them, and they work by creating an array which contains all the sysctl-ed symbols.)
|
239092 |
06-Aug-2012 |
luigi |
use FREE_PKT instead of m_freem to free an mbuf. The former is the standard form used in ipfw/dummynet, so that it is easier to remap it to different memory managers depending on the platform.
|
238988 |
02-Aug-2012 |
luigi |
replace __unused with a portable construct; fix a couple of signed/unsigned warnings.
|
238978 |
01-Aug-2012 |
luigi |
replace inet_ntoa_r with the more standard inet_ntop(). As discussed on -current, inet_ntoa_r() is non standard, has different arguments in userspace and kernel, and almost unused (no clients in userspace, only net/flowtable.c, net/if_llatbl.c, netinet/in_pcb.c, netinet/tcp_subr.c in the kernel)
|
238977 |
01-Aug-2012 |
luigi |
add a cast to avoid a signed/unsigned warning (to be removed when we will have TUNABLE_UINT constructors)
|
238277 |
09-Jul-2012 |
hrs |
Make ipfw0 logging pseudo-interface clonable. It can be created automatically by $firewall_logif rc.conf(5) variable at boot time or manually by ifconfig(8) after a boot.
Discussed on: freebsd-ipfw@
|
238265 |
08-Jul-2012 |
melifaro |
Finally fix lookup (account remaining '\0') and deletion (provide valid key length for radix lookup).
Submitted by: Ihor Kaharlichenko<madkinder at gmail.com> (prev version) Approved by: kib(mentor) MFC after: 3 days
Sponsored by: Shtorm ISP
|
238063 |
03-Jul-2012 |
issyl0 |
- Make ipfw's sched rules case insensitive, for user-friendliness. - Add a note to the ipfw(8) man page about the rules no longer being case sensitive. - Fix some typos in the man page.
PR: docs/164772 Reviewed by: bz Approved by: gabor (doc mentor, src committer) MFC after: 2 weeks
|
237479 |
23-Jun-2012 |
melifaro |
Fix interface matching by ipfw table
Submitted by: Ihor Kaharlichenko <madkinder@gmail.com> Tested by: Ihor Kaharlichenko <madkinder@gmail.com> Approved by: kib(mentor) MFC after: 3 days
|
236819 |
09-Jun-2012 |
melifaro |
Validate IPv4 network mask being passed to ipfw kernel interface. Incorrect mask can possibly be one of the reasons for kern/127209 existance.
Approved by: kib(mentor) MFC after: 3 days
|
234946 |
03-May-2012 |
melifaro |
Revert r234834 per luigi@ request.
Cleaner solution (e.g. adding another header) should be done here.
Original log: Move several enums and structures required for L2 filtering from ip_fw_private.h to ip_fw.h. Remove ipfw/ip_fw_private.h header from non-ipfw code.
Requested by: luigi Approved by: kib(mentor)
|
234834 |
30-Apr-2012 |
melifaro |
Move several enums and structures required for L2 filtering from ip_fw_private.h to ip_fw.h. Remove ipfw/ip_fw_private.h header from non-ipfw code.
Approved by: ae(mentor) MFC after: 2 weeks
|
233745 |
31-Mar-2012 |
glebius |
Don't check malloc(M_WAITOK) results.
|
233478 |
25-Mar-2012 |
melifaro |
- Permit number of ipfw tables to be changed in runtime.
net.inet.ip.fw.tables_max is now read-write.
- Bump IPFW_TABLES_MAX to 65535 Default number of tables is still 128
- Remove IPFW_TABLES_MAX from ipfw(8) code.
Sponsored by Yandex LLC
Approved by: kib(mentor)
MFC after: 2 weeks
|
232868 |
12-Mar-2012 |
melifaro |
Fix VNET build broken by r232865. Temporary remove the ability to assign different number of tables per VNET instance.
|
232865 |
12-Mar-2012 |
melifaro |
- Add ipfw eXtended tables permitting radix to be used for any kind of keys. - Add support for IPv6 and interface extended tables - Make number of tables to be loader tunable in range 0..65534. - Use IP_FW3 opcode for all new extended table cmds
No ABI changes are introduced. Old userland will see valid tables for IPv4 tables and no entries otherwise. Flush works for any table.
IP_FW3 socket option is used to encapsulate all new opcodes: /* IP_FW3 header/opcodes */ typedef struct _ip_fw3_opheader { uint16_t opcode; /* Operation opcode */ uint16_t reserved[3]; /* Align to 64-bit boundary */ } ip_fw3_opheader;
New opcodes added: IP_FW_TABLE_XADD, IP_FW_TABLE_XDEL, IP_FW_TABLE_XGETSIZE, IP_FW_TABLE_XLIST
ipfw(8) table argument parsing behavior is changed: 'ipfw table 999 add host' now assumes 'host' to be interface name instead of hostname.
New tunable: net.inet.ip.fw.tables_max controls number of table supported by ipfw in given VNET instance. 128 is still the default value.
New syntax: ipfw add skipto tablearg ip from any to any via table(42) in ipfw add skipto tablearg ip from any to any via table(4242) out
This is a bit hackish, special interface name '\1' is used to signal interface table number is passed in p.glob field.
Sponsored by Yandex LLC
Reviewed by: ae Approved by: ae (mentor)
MFC after: 4 weeks
|
232273 |
28-Feb-2012 |
oleg |
- Refresh dynamic tcp rule only if both sides answered keepalive packets. - Remove some useless assignments.
MFC after: 1 month
|
232272 |
28-Feb-2012 |
oleg |
lookup_dyn_rule_locked(): style(9) cleanup
MFC after: 1 month
|
231991 |
22-Feb-2012 |
ae |
Don't use `m' after m_megapullup.
PR: kern/165373 MFC after: 3 days
|
231852 |
17-Feb-2012 |
bz |
Merge multi-FIB IPv6 support from projects/multi-fibv6/head/:
Extend the so far IPv4-only support for multiple routing tables (FIBs) introduced in r178888 to IPv6 providing feature parity.
This includes an extended rtalloc(9) KPI for IPv6, the necessary adjustments to the network stack, and user land support as in netstat.
Sponsored by: Cisco Systems, Inc. Reviewed by: melifaro (basically) MFC after: 10 days
|
231076 |
06-Feb-2012 |
glebius |
Make the 'tcpwin' option of ipfw(8) accept ranges and lists.
Submitted by: sem
|
230614 |
27-Jan-2012 |
luigi |
a variable was erroneously declared as 32 bit instead of 64.
MFC after: 3 days
|
230452 |
22-Jan-2012 |
bz |
Make #error messages string-literals and remove punctuation.
Reported by: bde (for ip_divert) Reviewed by: bde MFC after: 3 days
|
227458 |
11-Nov-2011 |
eadler |
- add a missing "be" and "in" - fix other errors introduced when committing r226436 - add 'function' to a sentence where it makes sense
Submitted by: delphij Submitted by: dougb Submitted by: jhb Approved by: dougb Approved by: jhb
|
227309 |
07-Nov-2011 |
ed |
Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.
The SYSCTL_NODE macro defines a list that stores all child-elements of that node. If there's no SYSCTL_DECL macro anywhere else, there's no reason why it shouldn't be static.
|
227293 |
07-Nov-2011 |
ed |
Mark MALLOC_DEFINEs static that have no corresponding MALLOC_DECLAREs.
This means that their use is restricted to a single C file.
|
227085 |
04-Nov-2011 |
bz |
Always use the opt_*.h options for ipfw.ko, not just when compiled into the kernel. Do not try to build the module in case of no INET support but keep #error calls for now in case we would compile it into the kernel.
This should fix an issue where the module would fail to enable IPv6 support from the rc framework, but also other INET and INET6 parts being silently compiled out without giving a warning in the module case.
While here garbage collect unneeded opt_*.h includes. opt_ipdn.h is not used anywhere but we need to leave the DUMMYNET entry in options for conditional inclusion in kernel so keep the file with the same name.
Reported by: pluknet Reviewed by: plunket, jhb MFC After: 3 days
|
226436 |
16-Oct-2011 |
eadler |
- change "is is" to "is" or "it is" - change "the the" to "the"
Approved by: lstewart Approved by: sahil (mentor) MFC after: 3 days
|
225793 |
27-Sep-2011 |
bz |
Unbreak no-ip and no-inet6 module builds with ipfw. For now continue to build the ip_fw_pfil.c hooks and ipfw even in case of no-ip under the assumption that the private L2 hook (which hopefully eventually will be a pfil hook as well) can still be useful.
Allow building the module without inet as well.
Glanced at by: jhb MFC after: 3 days
|
225518 |
12-Sep-2011 |
jhb |
Allow the ipfw.ko module built with a kernel to honor any IPFIREWALL_* options defined in the kernel config. This more closely matches the behavior of other modules which inherit configuration settings from the kernel configuration during a kernel + modules build.
Reviewed by: luigi Approved by: re (kib) MFC after: 1 week
|
225044 |
20-Aug-2011 |
bz |
Add support for IPv6 to ipfw fwd: Distinguish IPv4 and IPv6 addresses and optional port numbers in user space to set the option for the correct protocol family. Add support in the kernel for carrying the new IPv6 destination address and port. Add support to TCP and UDP for IPv6 and fix UDP IPv4 to not change the address in the IP header. Add support for IPv6 forwarding to a non-local destination. Add a regession test uitilizing VIMAGE to check all 20 possible combinations I could think of.
Obtained from: David Dolson at Sandvine Incorporated (original version for ipfw fwd IPv6 support) Sponsored by: Sandvine Incorporated PR: bin/117214 MFC after: 4 weeks Approved by: re (kib)
|
225036 |
20-Aug-2011 |
bz |
Hide IPv6 next header parsing warnings under the verbose sysctl so people can possibly disable it when their consoles are flooded, or enabled it for debugging.
MFC after: 2 weeks Approved by: re (kib)
|
225034 |
20-Aug-2011 |
bz |
After r225032 fix logging in a similar way masking the the IPv6 more fragments flag off so that offset == 0 checks work properly.
PR: kern/145733 Submitted by: Matthew Luckie (mjl luckie.org.nz) MFC after: 2 weeks X-MFC with: r225032 Approved by: re (kib)
|
225033 |
20-Aug-2011 |
bz |
If we detect an IPv6 fragment header and it is not the first fragment, then terminate the loop as we will not find any further headers and for short fragments this could otherwise lead to a pullup error discarding the fragment.
PR: kern/145733 Submitted by: Matthew Luckie (mjl luckie.org.nz) MFC after: 2 weeks Approved by: re (kib)
|
225032 |
20-Aug-2011 |
bz |
ipfw internally checks for offset == 0 to determine whether the packet is a/the first fragment or not. For IPv6 we have added the "more fragments" flag as well to be able to determine on whether there will be more as we do not have the fragment header avaialble for logging, while for IPv4 this information can be derived directly from the IPv4 header. This allowed fragmented packets to bypass normal rules as proper masking was not done when checking offset. Split variables to not need masking for IPv6 to avoid further errors.
PR: kern/145733 Submitted by: Matthew Luckie (mjl luckie.org.nz) MFC after: 2 weeks Approved by: re (kib)
|
225030 |
20-Aug-2011 |
bz |
While not explicitly allowed by RFC 2460, in case there is no translation technology involved (and that section is suggested to be removed by Errata 2843), single packet fragments do not harm.
There is another errata under discussion to clarify and allow this. Meanwhile add a sysctl to allow disabling this behaviour again. We will treat single packet fragment (a fragment header added when not needed) as if there was no fragment header.
PR: kern/145733 Submitted by: Matthew Luckie (mjl luckie.org.nz) (original version) Tested by: Matthew Luckie (mjl luckie.org.nz) MFC after: 2 weeks Approved by: re (kib)
|
223666 |
29-Jun-2011 |
ae |
Add new rule actions "call" and "return" to ipfw. They make possible to organize subroutines with rules.
The "call" action saves the current rule number in the internal stack and rules processing continues from the first rule with specified number (similar to skipto action). If later a rule with "return" action is encountered, the processing returns to the first rule with number of "call" rule saved in the stack plus one or higher.
Submitted by: Vadim Goncharov Discussed by: ipfw@, luigi@
|
223637 |
28-Jun-2011 |
bz |
Update packet filter (pf) code to OpenBSD 4.5.
You need to update userland (world and ports) tools to be in sync with the kernel.
Submitted by: mlaier Submitted by: eri
|
223593 |
27-Jun-2011 |
glebius |
Add possibility to pass IPv6 packets to a divert(4) socket.
Submitted by: sem
|
223358 |
21-Jun-2011 |
ae |
Do not use SET_HOST_IPLEN() macro for IPv6 packets.
PR: kern/157239 MFC after: 2 weeks
|
223080 |
14-Jun-2011 |
ae |
Implement "global" mode for ipfw nat. It is similar to natd(8) "globalport" option for multiple NAT instances.
If ipfw rule contains "global" keyword instead of nat_number, then for each outgoing packet ipfw_nat looks up translation state in all configured nat instances. If an entry is found, packet aliased according to that entry, otherwise packet is passed unchanged.
User can specify "skip_global" option in NAT configuration to exclude an instance from the lookup in global mode.
PR: kern/157867 Submitted by: Alexander V. Chernikov (previous version) Tested by: Eugene Grosbein
|
223073 |
14-Jun-2011 |
ae |
Add IPv6 support to the ipfw uid/gid check. Pass an ip_fw_args structure to the check_uidgid() function, since it contains all needed arguments and also pointer to mbuf and now it is possible use in_pcblookup_mbuf() function.
Since i can not test it for the non-FreeBSD case, i keep this ifdef unchanged.
Tested by: Alexander V. Chernikov MFC after: 3 weeks
|
222806 |
07-Jun-2011 |
ae |
Make a behaviour of the libalias based in-kernel NAT a bit closer to how natd(8) does work. natd(8) drops packets only when libalias returns PKT_ALIAS_IGNORED and "deny_incoming" option is set, but ipfw_nat always did drop packets that were not aliased, even if they should not be aliased and just are going through.
PR: kern/122109, kern/129093, kern/157379 Submitted by: Alexander V. Chernikov (previous version) MFC after: 1 month
|
222748 |
06-Jun-2011 |
rwatson |
Implement a CPU-affine TCP and UDP connection lookup data structure, struct inpcbgroup. pcbgroups, or "connection groups", supplement the existing inpcbinfo connection hash table, which when pcbgroups are enabled, might now be thought of more usefully as a per-protocol 4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their 4-tuple; wildcard sockets require special handling, and are members of all connection groups. During a connection lookup, a per-connection group lock is employed rather than the global pcbinfo lock. By aligning connection groups with input path processing, connection groups take on an effective CPU affinity, especially when aligned with RSS work placement (see a forthcoming commit for details). This eliminates cache line migration associated with global, protocol-layer data structures in steady state TCP and UDP processing (with the exception of protocol-layer statistics; further commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's 2006 USENIX paper, "An Evaluation of Network Stack Parallelization Strategies in Modern Operating Systems". However, there are also significant differences: we maintain the inpcb lock, rather than using the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC packet distribution strategies such as RSS, rather than pure software strategies. Despite that focus, software distribution is supported through the parallel netisr implementation, and works well in configurations where the number of hardware threads is greater than the number of NIC input queues, such as in the RMI XLR threaded MIPS architecture.
Another important difference is the continued maintenance of existing hash tables as "reservation tables" -- these are useful both to distinguish the resource allocation aspect of protocol name management and the more common-case lookup aspect. In configurations where connection tables are aligned with hardware hashes, it is desirable to use the traditional lookup tables for loopback or encapsulated traffic rather than take the expense of hardware hashes that are hard to implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP" into your kernel configuration; for the time being, this is an experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb, and its change to the inpcbinfo init function signature, this change in principle could be merged to FreeBSD 8.x.
Reviewed by: bz Sponsored by: Juniper Networks, Inc.
|
222742 |
06-Jun-2011 |
ae |
Do not return EINVAL when user does `ipfw set N flush` on an empty set.
MFC after: 2 weeks
|
222582 |
01-Jun-2011 |
ae |
O_FORWARD_IP is only action which depends from the result of lookup of dynamic rules. We are doing forwarding in the following cases: o For the simple ipfw fwd rule, e.g.
fwd 10.0.0.1 ip from any to any out xmit em0 fwd 127.0.0.1,3128 tcp from any to any 80 in recv em1
o For the dynamic fwd rule, e.g.
fwd 192.168.0.1 tcp from any to 10.0.0.3 3333 setup keep-state
When this rule triggers it creates a dynamic rule, but this dynamic rule should forward packets only in forward direction.
o And the last case that does not work before - simple fwd rule which triggers when some dynamic rule is already executed.
PR: kern/147720, kern/150798 MFC after: 1 month
|
222560 |
01-Jun-2011 |
ae |
Hide some debug messages under debug macro.
MFC after: 1 week
|
222559 |
01-Jun-2011 |
ae |
Hide useless warning under debug macro.
PR: kern/69963 MFC after: 1 week
|
222488 |
30-May-2011 |
rwatson |
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and inpcb counter. This lock is now relegated to a small number of allocation and free operations, and occasional operations that walk all connections (including, awkwardly, certain UDP multicast receive operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for looking up connections and bound sockets, manipulated using new INP_HASH_*() macros. This lock, combined with inpcb locks, protects the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb connection locks, so may be acquired while manipulating a connection on which a lock is already held, avoiding the need to acquire the inpcbinfo lock preemptively when a binding change might later be required. As a result, however, lookup operations necessarily go through a reference acquire while holding the lookup lock, later acquiring an inpcb lock -- if required.
A new function in_pcblookup() looks up connections, and accepts flags indicating how to return the inpcb. Due to lock order changes, callers no longer need acquire locks before performing a lookup: the lookup routine will acquire the ipi_hash_lock as needed. In the future, it will also be able to use alternative lookup and locking strategies transparently to callers, such as pcbgroup lookup. New lookup flags are, supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially, TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely eliminated, and global hash lock hold times are dramatically reduced compared to previous locking. - The TCP syncache still relies on the pcbinfo lock, something that we may want to revisit. - Support for reverting to the FreeBSD 7.x locking strategy in TCP input is no longer available -- hash lookup locks are now held only very briefly during inpcb lookup, rather than for potentially extended periods. However, the pcbinfo ipi_lock will still be acquired if a connection state might change such that a connection is added or removed. - Raw IP sockets continue to use the pcbinfo ipi_lock for protection, due to maintaining their own hash tables. - The interface in6_pcblookup_hash_locked() is maintained, which allows callers to acquire hash locks and perform one or more lookups atomically with 4-tuple allocation: this is required only for TCPv6, as there is no in6_pcbconnect_setup(), which there should be. - UDPv6 locking remains significantly more conservative than UDPv4 locking, which relates to source address selection. This needs attention, as it likely significantly reduces parallelism in this code for multithreaded socket use (such as in BIND). - In the UDPv4 and UDPv6 multicast cases, we need to revisit locking somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which is no longer sufficient. A second check once the inpcb lock is held should do the trick, keeping the general case from requiring the inpcb lock for every inpcb visited. - This work reminds us that we need to revisit locking of the v4/v6 flags, which may be accessed lock-free both before and after this change. - Right now, a single lock name is used for the pcbhash lock -- this is undesirable, and probably another argument is required to take care of this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and locking semantics. It's possible some of these issues could be worked around with compatibility wrappers, if necessary.
Reviewed by: bz Sponsored by: Juniper Networks, Inc.
|
222474 |
30-May-2011 |
ae |
Wrap long line.
MFC after: 2 weeks
|
222473 |
30-May-2011 |
ae |
Add tablearg support for ipfw setfib.
PR: kern/156410 MFC after: 2 weeks
|
221521 |
06-May-2011 |
ae |
Convert delay parameter back to ms when reporting to user.
PR: 156838 MFC after: 1 week
|
220914 |
21-Apr-2011 |
glebius |
Use size_t for sopt_valsize.
Submitted by: Brandon Gooch <jamesbrandongooch gmail.com>
|
220878 |
20-Apr-2011 |
bz |
MFp4 CH=191466:
Move fw_one_pass to where it belongs: it is a property of ipfw, not of ip_input.
Reviewed by: gnn Sponsored by: The FreeBSD Foundation Sponsored by: iXsystems MFC after: 3 days
|
220837 |
19-Apr-2011 |
glebius |
- Rewrite functions that copyin/out NAT configuration, so that they calculate required memory size dynamically. - Fix races on chain re-lock. - Introduce new field to ip_fw_chain - generation count. Now utilized only in the NAT configuration, but can be utilized wider in ipfw. - Get rid of NAT_BUF_LEN in ip_fw.h
PR: kern/143653
|
220832 |
19-Apr-2011 |
ae |
Add sysctl handlers for net.inet.ip.dummynet.hash_size, .pipe_byte_limit and .pipe_slot_limit oids to prevent to set incorrect values.
MFC after: 2 weeks
|
220831 |
19-Apr-2011 |
ae |
ipdn_bound_var() functions is designed to bound a variable between specified minimum and maximum. In case when specified default value is out of bounds it does not work as expected and does not limit variable. Check that default value is in range and limit it if needed. Also bump max_hash_size value to 65536 to correspond with manual page.
PR: kern/152887 MFC after: 2 weeks
|
220812 |
19-Apr-2011 |
ae |
Use M_WAITOK instead M_WAIT for malloc. Remove unneded checks.
MFC after: 1 week
|
220800 |
18-Apr-2011 |
glebius |
LibAliasInit() should allocate memory with M_WAITOK flag. Modify it and its callers.
|
220796 |
18-Apr-2011 |
glebius |
Pullup up to TCP header length before matching against 'tcpopts'.
PR: kern/156180 Reviewed by: luigi
|
220568 |
12-Apr-2011 |
ae |
Restore previous behaviour - always match rule when we doing tagging, even when tag is already exists.
Reported by: Vadim Goncharov MFC after: 1 week
|
220211 |
31-Mar-2011 |
ae |
Fill up src_port and dst_port variables for SCTP over IPv4.
PR: kern/153415 MFC after: 1 week
|
220204 |
31-Mar-2011 |
ae |
Fix malloc types.
MFC after: 1 week
|
220203 |
31-Mar-2011 |
ae |
Fix a memory leak. Memory that is allocated for schedulers hash table was not freed.
PR: kern/156083 MFC after: 1 week
|
218909 |
21-Feb-2011 |
brucec |
Fix typos - remove duplicate "the".
PR: bin/154928 Submitted by: Eitan Adler <lists at eitanadler.com> MFC after: 3 days
|
218741 |
16-Feb-2011 |
pluknet |
Bump dummynet module version to meet dummynet schedulers' requirements, and thus unbreak loading dummynet.ko via /boot/loader.conf.
Reported by: rihad <rihad att mail.ru> on freebsd-net Approved by: kib (mentor)
|
218360 |
05-Feb-2011 |
luigi |
correct the 'output_time' of packets generated by dummynet. In the dec.2009 rewrite I introduced a bug, using for the computation the arrival time instead of the time the packet has exited from the queue. The bandwidth computation was still correct because it is computed elsewhere, but traffic was sent out in bursts.
The bug is also present in RELENG_8 after dec.2009
Thanks to Daikichi Osuga for investingating, finding and fixing the bug with detailed graphs of the behaviour before and after the fix.
Submitted by: Daikichi Osuga MFC after: 2 weeks
|
217361 |
13-Jan-2011 |
jhb |
Use a blocking malloc() to initialize the dummynet taskq.
Reviewed by: luigi
|
217322 |
12-Jan-2011 |
mdf |
sysctl(9) cleanup checkpoint: amd64 GENERIC builds cleanly.
Commit the net* piece.
|
217110 |
07-Jan-2011 |
jhb |
Use a regular taskqueue for dummynet rather than a "fast" taskqueue.
Reviewed by: luigi
|
215701 |
22-Nov-2010 |
dim |
After some off-list discussion, revert a number of changes to the DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various people working on the affected files. A better long-term solution is still being considered. This reversal may give some modules empty set_pcpu or set_vnet sections, but these are harmless.
Changes reverted:
------------------------------------------------------------------------ r215318 | dim | 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) | 4 lines
Instead of unconditionally emitting .globl's for the __start_set_xxx and __stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu sections are actually defined.
------------------------------------------------------------------------ r215317 | dim | 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) | 3 lines
Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout the tree.
------------------------------------------------------------------------ r215316 | dim | 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) | 2 lines
Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE.
|
215317 |
14-Nov-2010 |
dim |
Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout the tree.
|
215179 |
12-Nov-2010 |
luigi |
The first customer of the SO_USER_COOKIE option: the "sockarg" ipfw option matches packets associated to a local socket and with a non-zero so_user_cookie value. The value is made available as tablearg, so it can be used as a skipto target or pipe number in ipfw/dummynet rules.
Code by Paul Joe, manpage by me.
Submitted by: Paul Joe MFC after: 1 week
|
213329 |
01-Oct-2010 |
luigi |
put back the assigment to sched_time. It was correct, and it was necessary.
Submitted by: Riccardo Panicucci
|
213279 |
29-Sep-2010 |
luigi |
remove an unnecessary (and wrong) assignment. It was meant to reset idle_time (and it was not needed), but i even used the wrong field.
Obtained from: Oleg MFC after: 3 days
|
213267 |
29-Sep-2010 |
luigi |
whitespace changes in preparation for future commits
|
213265 |
29-Sep-2010 |
luigi |
fix handling of initial credit for an idle pipe. This fixes the bug where setting bw > 1 MTU/tick resulted in infinite bandwidth if io_fast=1
PR: 147245 148429 Obtained from: Riccardo Panicucci MFC after: 3 days
|
213254 |
28-Sep-2010 |
luigi |
fix breakage in in-kernel NAT: the code did not honor net.inet.ip.fw.one_pass and always moved to the next rule in case of a successful nat.
This should fix several related PR (waiting for feedback before closing them)
PR: 145167 149572 150141 MFC after: 3 days
|
213253 |
28-Sep-2010 |
luigi |
Whitespace changes to reduce diffs wrt the most recent ipfw/dummynet code: + remove an unused macro, + adjust the constants in an enum + small whitespace changes
MFC after: 3 days
|
212256 |
06-Sep-2010 |
glebius |
in_delayed_cksum() requires host byte order.
Reported by: Alexander Levin <amindomao googlemail.com> MFC after: 1 week
|
211992 |
30-Aug-2010 |
maxim |
o Some programs could send broadcast/multicast traffic to ipfw pseudo-interface. This leads to a panic due to uninitialized if_broadcastaddr address. Initialize it and implement ip_output() method to prevent mbuf leak later.
ipfw pseudo-interface should never send anything therefore call panic(9) in if_start() method.
PR: kern/149807 Submitted by: Dmitrij Tejblum MFC after: 2 weeks
|
210537 |
27-Jul-2010 |
glebius |
Fix operation of "netgraph" action in conjunction with the net.inet.ip.fw.one_pass sysctl.
The "ngtee" action is still broken.
PR: kern/148885 Submitted by: Nickolay Dudorov <nnd mail.nsk.ru>
|
210123 |
15-Jul-2010 |
luigi |
remove some conditional #ifdefs (no-op on FreeBSD); run the timer routine on cpu 0.
|
210120 |
15-Jul-2010 |
luigi |
whitespace fixes
|
210119 |
15-Jul-2010 |
luigi |
fix a comment and final empty line
|
209845 |
09-Jul-2010 |
glebius |
Improve last commit: use bpf_mtap2() to avoiding stack usage.
Prodded by: julian
|
209797 |
08-Jul-2010 |
glebius |
Since r209216 bpf(4) searches for mbuf_tags(9) and thus will not work with a stub m_hdr instead of a full mbuf.
PR: kern/148050
|
209589 |
29-Jun-2010 |
glebius |
After processing the O_SKIPTO opcode our cmd points to the next rule, and "match" processing at the end of inner loop would look ahead into the next rule, which is incorrect. Particularly, in the case when the next rule started with F_NOT opcode it was skipped blindly.
To fix this, exit the inner loop with the continue operator forcibly and explicitly.
PR: kern/147798
|
206845 |
19-Apr-2010 |
luigi |
whitespace fixes (trailing whitespace, bad indentation after a merge, etc.)
|
206461 |
10-Apr-2010 |
bz |
Try to help with a virtualized dummynet after r206428.
This adds the explicit include (so far probably included through one of the few "hidden" includes in other header files) for vnet.h and adds a cast to unbreak LINT-VIMAGE.
|
206428 |
09-Apr-2010 |
luigi |
This commit enables partial operation of dummynet with kernels compiled with "options VIMAGE". As it is now, there is still a single instance of the pipes, and it is only usable from vnet0 (the main instance). Trying to use a pipe from a different vimage does not crash the system as it did before, but the traffic coming out from the pipe goes to the wrong place, and i still need to figure out where.
Support for per-vimage pipes is almost there (just a matter of uncommenting the VNET_* definitions for dn_cfg, plus putting into the structure the remaining static variables), however i need first to figure out how init/uninit work, and also to understand where packets are ending up on exit from a pipe.
In summary: vimage support for dummynet is not complete yet, but we are getting there.
|
206425 |
09-Apr-2010 |
luigi |
no need to pass an argument to dn_compat_calc_size()
MFC after: 3 days
|
206339 |
07-Apr-2010 |
luigi |
Hopefully fix the recent breakage in rule deletion. A few more tests and this will also go into -stable where the problem is more critical.
|
205955 |
31-Mar-2010 |
luigi |
fix bug in previous commit related to rule deletion (stable/8 just fixed moments ago)
|
205831 |
29-Mar-2010 |
luigi |
remove a leftover debugging message
|
205830 |
29-Mar-2010 |
luigi |
Fix handling of set manipulations. This patch has two fixes for potential kernel panics (one wrong index, one access to the wrong lock) and two fixes to wrong logic in a conditional. The potential panics are also on stable/8, so I am going to MFC the fix quickly.
|
205602 |
24-Mar-2010 |
luigi |
Honor ip.fw.one_pass when a packet comes out of a pipe without being delayed. I forgot to handle this case when i did the mtag cleanup three months ago.
PR: 145004
|
205417 |
21-Mar-2010 |
luigi |
Add a priority-based packet scheduler.
Sponsored by: The ONELAB2 Project Submitted by: Riccardo Panicucci
|
205415 |
21-Mar-2010 |
luigi |
no need for ipfw_flush_tables(), we just need ipfw_destroy_tables()
|
205414 |
21-Mar-2010 |
luigi |
revise documentation
|
205178 |
15-Mar-2010 |
luigi |
small fixes to estimate the buffer size when requesting all pipes/flows.
|
205173 |
15-Mar-2010 |
luigi |
+ implement (two lines) the kernel side of 'lookup dscp N' to use the dscp as a search key in table lookups;
+ (re)implement a sysctl variable to control the expire frequency of pipes and queues when they become empty;
+ add 'queue number' as optional part of the flow_id. This can be enabled with the command
queue X config mask queue ...
and makes it possible to support priority-based schedulers, where packets should be grouped according to the priority and not some fields in the 5-tuple. This is implemented as follows: - redefine a field in the ipfw_flow_id (in sys/netinet/ip_fw.h) but without changing the size or shape of the structure, so there are no ABI changes. On passing, also document how other fields are used, and remove some useless assignments in ip_fw2.c
- implement small changes in the userland code to set/read the field;
- revise the functions in ip_dummynet.c to manipulate masks so they also handle the additional field;
There are no ABI changes in this commit.
|
205050 |
11-Mar-2010 |
luigi |
implement listing of a subset of pipes/queues/schedulers. The filtering of the output is done in the kernel instead of userland to reduce the amount of data transfered.
|
204954 |
10-Mar-2010 |
luigi |
fix handling of commands issued by RELENG_7 version of /sbin/ipfw,
Submitted by: Riccardo Panicucci
|
204866 |
08-Mar-2010 |
luigi |
cosmetic changes and C++ compatibility
|
204865 |
08-Mar-2010 |
luigi |
don't use C++ keywords as variable names
|
204862 |
08-Mar-2010 |
luigi |
do not report an error unnecessarily
|
204837 |
07-Mar-2010 |
bz |
Not only flush the ipfw tables when unloading ipfw or tearing down a virtual netowrk stack, but also free the Radix Node Head.
Sponsored by: ISPsystem Reviewed by: julian MFC after: 5 days
|
204763 |
05-Mar-2010 |
luigi |
plug a memory leak on pipe's reconfiguration
|
204754 |
05-Mar-2010 |
luigi |
fix a memory leak when deleting RED queues
|
204736 |
04-Mar-2010 |
luigi |
portability fixes
|
204735 |
04-Mar-2010 |
luigi |
don't use keywords as variable names.
|
204714 |
04-Mar-2010 |
luigi |
use callout_drain() (outside the lock) when unloading the module. This prevents a potential deadlock.
Submitted by: Francesco Magno
|
204713 |
04-Mar-2010 |
luigi |
improve compatibility with RELENG_7.2
|
204591 |
02-Mar-2010 |
luigi |
Bring in the most recent version of ipfw and dummynet, developed and tested over the past two months in the ipfw3-head branch. This also happens to be the same code available in the Linux and Windows ports of ipfw and dummynet.
The major enhancement is a completely restructured version of dummynet, with support for different packet scheduling algorithms (loadable at runtime), faster queue/pipe lookup, and a much cleaner internal architecture and kernel/userland ABI which simplifies future extensions.
In addition to the existing schedulers (FIFO and WF2Q+), we include a Deficit Round Robin (DRR or RR for brevity) scheduler, and a new, very fast version of WF2Q+ called QFQ.
Some test code is also present (in sys/netinet/ipfw/test) that lets you build and test schedulers in userland.
Also, we have added a compatibility layer that understands requests from the RELENG_7 and RELENG_8 versions of the /sbin/ipfw binaries, and replies correctly (at least, it does its best; sometimes you just cannot tell who sent the request and how to answer). The compatibility layer should make it possible to MFC this code in a relatively short time.
Some minor glitches (e.g. handling of ipfw set enable/disable, and a workaround for a bug in RELENG_7's /sbin/ipfw) will be fixed with separate commits.
CREDITS: This work has been partly supported by the ONELAB2 project, and mostly developed by Riccardo Panicucci and myself. The code for the qfq scheduler is mostly from Fabio Checconi, and Marta Carbone and Francesco Magno have helped with testing, debugging and some bug fixes.
|
204003 |
17-Feb-2010 |
luigi |
remove recursive lock/unlock calls, we do them already before entering the switch.
Reported by: Marta Carbone
|
202459 |
17-Jan-2010 |
ume |
Change 'me' to match any IPv6 address configured on an interface in the system as well as any IPv4 address.
Reviewed by: David Horn <dhorn2000__at__gmail.com>, luigi, qingli MFC after: 2 weeks
|
201745 |
07-Jan-2010 |
luigi |
we don't use dummynet_drain!
|
201740 |
07-Jan-2010 |
luigi |
check that we have an ipv4 packet before swapping ip_len and ip_off. This should fix the handling of ipv6 packets which i broke when i made ipfw operate on packets in network format.
Reported by: Hajimu UMEMOTO
|
201735 |
07-Jan-2010 |
luigi |
Following up on a request from Ermal Luci to make ip_divert work as a client of pf(4), make ip_divert not depend on ipfw.
This is achieved by moving to ip_var.h the struct ipfw_rule_ref (which is part of the mtag for all reinjected packets) and other declarations of global variables, and moving to raw_ip.c global variables for filter and divert hooks.
Note that names and locations could be made more generic (ipfw_rule_ref is really a generic reference robust to reconfigurations; the packet filter is not necessarily ipfw; filters and their clients are not necessarily limited to ipv4), but _right now_ most of this stuff works on ipfw and ipv4, so i don't feel like doing a gratuitous renaming, at least for the time being.
|
201732 |
07-Jan-2010 |
luigi |
some header shuffling to help decoupling ip_divert from ipfw
|
201722 |
07-Jan-2010 |
luigi |
put ip_len in correct order for ip_output(). This prevents a panic when ipfw generates packets on its own (such as reject or keepalives for dynamic rules).
Reported by: Chagin Dmitry
|
201568 |
05-Jan-2010 |
luigi |
this file does not require ip_dummynet.h
|
201527 |
04-Jan-2010 |
luigi |
Various cleanup done in ipfw3-head branch including: - use a uniform mtag format for all packets that exit and re-enter the firewall in the middle of a rulechain. On reentry, all tags containing reinject info are renamed to MTAG_IPFW_RULE so the processing is simpler.
- make ipfw and dummynet use ip_len and ip_off in network format everywhere. Conversion is done only once instead of tracking the format in every place.
- use a macro FREE_PKT to dispose of mbufs. This eases portability.
On passing i also removed a few typos, staticise or localise variables, remove useless declarations and other minor things.
Overall the code shrinks a bit and is hopefully more readable.
I have tested functionality for all but ng_ipfw and if_bridge/if_ethersubr. For ng_ipfw i am actually waiting for feedback from glebius@ because we might have some small changes to make. For if_bridge and if_ethersubr feedback would be welcome (there are still some redundant parts in these two modules that I would like to remove, but first i need to check functionality).
|
201150 |
29-Dec-2009 |
luigi |
we really need htonl() here, see the comment a few lines above in the code.
|
201124 |
28-Dec-2009 |
luigi |
bring the NGM_IPFW_COOKIE back into ng_ipfw.h, libnetgraph expects to find it there. Unfortunately this reintroduces the dependency on ip_fw_pfil.c
|
201122 |
28-Dec-2009 |
luigi |
bring in several cleanups tested in ipfw3-head branch, namely:
r201011 - move most of ng_ipfw.h into ip_fw_private.h, as this code is ipfw-specific. This removes a dependency on ng_ipfw.h from some files.
- move many equivalent definitions of direction (IN, OUT) for reinjected packets into ip_fw_private.h
- document the structure of the packet tags used for dummynet and netgraph;
r201049 - merge some common code to attach/detach hooks into a single function.
r201055 - remove some duplicated code in ip_fw_pfil. The input and output processing uses almost exactly the same code so there is no need to use two separate hooks. ip_fw_pfil.o goes from 2096 to 1382 bytes of .text
r201057 (see the svn log for full details) - macros to make the conversion of ip_len and ip_off between host and network format more explicit
r201113 (the remaining parts) - readability fixes -- put braces around some large for() blocks, localize variables so the compiler does not think they are uninitialized, do not insist on precise allocation size if we have more than we need.
r201119 - when doing a lookup, keys must be in big endian format because this is what the radix code expects (this fixes a bug in the recently-introduced 'lookup' option)
No ABI changes in this commit.
MFC after: 1 week
|
201121 |
28-Dec-2009 |
luigi |
readability fixes -- add braces on large blocks, remove unnecessary initializations
|
201120 |
28-Dec-2009 |
luigi |
explain details of operation of table lookups, and improve portability
|
201046 |
27-Dec-2009 |
luigi |
diverted packet must re-enter _after_ the matching rule, or we create loops. The divert cookie (that can be set from userland too) contains the matching rule nr, so we must start from nr+1.
Reported by: Joe Marcus Clarke
|
200951 |
24-Dec-2009 |
luigi |
fix poor indentation resulting from a merge
|
200909 |
23-Dec-2009 |
luigi |
mostly style changes, such as removal of trailing whitespace, reformatting to avoid unnecessary line breaks, small block restructuring to avoid unnecessary nesting, replace macros with function calls, etc.
As a side effect of code restructuring, this commit fixes one bug: previously, if a realloc() failed, memory was leaked. Now, the realloc is not there anymore, as we first count how much memory we need and then do a single malloc.
|
200897 |
23-Dec-2009 |
luigi |
fix build with the new fast lookup structure. Also remove some unnecessary headers
|
200896 |
23-Dec-2009 |
luigi |
fix build on 64-bit architectures. Also fix the indentation on a few lines.
|
200855 |
22-Dec-2009 |
luigi |
merge code from ipfw3-head to reduce contention on the ipfw lock and remove all O(N) sequences from kernel critical sections in ipfw.
In detail:
1. introduce a IPFW_UH_LOCK to arbitrate requests from the upper half of the kernel. Some things, such as 'ipfw show', can be done holding this lock in read mode, whereas insert and delete require IPFW_UH_WLOCK.
2. introduce a mapping structure to keep rules together. This replaces the 'next' chain currently used in ipfw rules. At the moment the map is a simple array (sorted by rule number and then rule_id), so we can find a rule quickly instead of having to scan the list. This reduces many expensive lookups from O(N) to O(log N).
3. when an expensive operation (such as insert or delete) is done by userland, we grab IPFW_UH_WLOCK, create a new copy of the map without blocking the bottom half of the kernel, then acquire IPFW_WLOCK and quickly update pointers to the map and related info. After dropping IPFW_LOCK we can then continue the cleanup protected by IPFW_UH_LOCK. So userland still costs O(N) but the kernel side is only blocked for O(1).
4. do not pass pointers to rules through dummynet, netgraph, divert etc, but rather pass a <slot, chain_id, rulenum, rule_id> tuple. We validate the slot index (in the array of #2) with chain_id, and if successful do a O(1) dereference; otherwise, we can find the rule in O(log N) through <rulenum, rule_id>
All the above does not change the userland/kernel ABI, though there are some disgusting casts between pointers and uint32_t
Operation costs now are as follows:
Function Old Now Planned ------------------------------------------------------------------- + skipto X, non cached O(N) O(log N) + skipto X, cached O(1) O(1) XXX dynamic rule lookup O(1) O(log N) O(1) + skipto tablearg O(N) O(1) + reinject, non cached O(N) O(log N) + reinject, cached O(1) O(1) + kernel blocked during setsockopt() O(N) O(1) -------------------------------------------------------------------
The only (very small) regression is on dynamic rule lookup and this will be fixed in a day or two, without changing the userland/kernel ABI
Supported by: Valeria Paoli MFC after: 1 month
|
200838 |
22-Dec-2009 |
luigi |
some mostly cosmetic changes in preparation for upcoming work:
+ in many places, replace &V_layer3_chain with a local variable chain; + bring the counter of rules and static_len within ip_fw_chain replacing static variables; + remove some spurious comments and extern declaration; + document which lock protects certain data structures
|
200673 |
18-Dec-2009 |
ru |
Added proper attribution.
Requested by: luigi
|
200654 |
17-Dec-2009 |
luigi |
Add some experimental code to log traffic with tcpdump, similar to pflog(4). To use the feature, just put the 'log' options on rules you are interested in, e.g.
ipfw add 5000 count log ....
and run tcpdump -ni ipfw0 ...
net.inet.ip.fw.verbose=0 enables logging to ipfw0, net.inet.ip.fw.verbose=1 sends logging to syslog as before.
More features can be added, similar to pflog(), to store in the MAC header metadata such as rule numbers and actions. Manpage to come once features are settled.
|
200634 |
17-Dec-2009 |
luigi |
simplify and document lookup_next_rule()
|
200629 |
17-Dec-2009 |
luigi |
simplify the code that finds the next rule after reinjections
MFC after: 1 week
|
200610 |
16-Dec-2009 |
luigi |
remove a duplicate sysctl entry
|
200603 |
16-Dec-2009 |
luigi |
bring back a couple of #include that are supplied by nesting, and explain why they are used.
|
200601 |
16-Dec-2009 |
luigi |
Various cosmetic cleanup of the files: - move global variables around to reduce the scope and make them static if possible; - add an ipfw_ prefix to all public functions to prevent conflicts (the same should be done for variables); - try to pack variable declaration in an uniform way across files; - clarify some comments; - remove some misspelling of names (#define V_foo VNET(bar)) that slipped in due to cut&paste - remove duplicate static variables in different files;
MFC after: 1 month
|
200598 |
16-Dec-2009 |
imp |
Quick fix to make this compile: Remove redundant extern declearations. If the maintainer has a better fix, then feel free to back this out.
|
200590 |
15-Dec-2009 |
luigi |
more splitting of ip_fw2.c, now extract the 'table' routines and the sockopt routines (the upper half of the kernel).
Whoever is the author of the 'table' code (Ruslan/glebius/oleg ?) please change the attribution in ip_fw_table.c. I have copied the copyright line from ip_fw2.c but it carries my name and I have neither written nor designed the feature so I don't deserve the credit.
MFC after: 1 month
|
200580 |
15-Dec-2009 |
luigi |
Start splitting ip_fw2.c and ip_fw.h into smaller components. At this time we pull out from ip_fw2.c the logging functions, and support for dynamic rules, and move kernel-only stuff into netinet/ipfw/ip_fw_private.h
No ABI change involved in this commit, unless I made some mistake. ip_fw.h has changed, though not in the userland-visible part.
Files touched by this commit:
conf/files now references the two new source files
netinet/ip_fw.h remove kernel-only definitions gone into netinet/ipfw/ip_fw_private.h.
netinet/ipfw/ip_fw_private.h new file with kernel-specific ipfw definitions
netinet/ipfw/ip_fw_log.c ipfw_log and related functions
netinet/ipfw/ip_fw_dynamic.c code related to dynamic rules
netinet/ipfw/ip_fw2.c removed the pieces that goes in the new files
netinet/ipfw/ip_fw_nat.c minor rearrangement to remove LOOKUP_NAT from the main headers. This require a new function pointer.
A bunch of other kernel files that included netinet/ip_fw.h now require netinet/ipfw/ip_fw_private.h as well. Not 100% sure i caught all of them.
MFC after: 1 month
|
200567 |
15-Dec-2009 |
luigi |
implement a new match option,
lookup {dst-ip|src-ip|dst-port|src-port|uid|jail} N
which searches the specified field in table N and sets tablearg accordingly. With dst-ip or src-ip the option replicates two existing options. When used with other arguments, the option can be useful to quickly dispatch traffic based on other fields.
Work supported by the Onelab project.
MFC after: 1 week
|
200361 |
10-Dec-2009 |
luigi |
use div64 when converting back the burst value for userland
|
200360 |
10-Dec-2009 |
luigi |
when draining a flowset free the entire chain, not just one packet.
|
200358 |
10-Dec-2009 |
luigi |
centralize the code to free a packet (or a chain) while in dummynet. Remove an old macro and its stale comment.
|
200170 |
05-Dec-2009 |
oleg |
Fix burst processing for WF2Q pipes - do not increase available burst size unless pipe is idle. This should fix follwing issues: - 'dummynet: OUCH! pipe should have been idle!' log messages. - exceeding configured pipe bandwidth.
MFC after: 1 week
|
200118 |
05-Dec-2009 |
luigi |
adjust comment in previous commit after Julian's explanation
|
200116 |
05-Dec-2009 |
luigi |
remove a dead block of code, document how the ipfw clients are hooked and the difference in handling the 'enable' variable for layer2 and layer3. The latter needs fixing once i figure out how it worked pre-vnet.
MFC after: 7 days
|
200113 |
05-Dec-2009 |
luigi |
fix build with VNET enabled
Reported by: David Wolfskill
|
200102 |
04-Dec-2009 |
ume |
Use INET_ADDRSTRLEN and INET6_ADDRSTRLEN rather than hard coded number.
Spotted by: bz
|
200059 |
03-Dec-2009 |
luigi |
preparation work to replace the monster switch in ipfw_chk() with table of functions.
This commit (which is heavily based on work done by Marta Carbone in this year's GSOC project), removes the goto's and explicit return from the inner switch(), so we will have a easier time when putting the blocks into individual functions.
MFC after: 3 weeks
|
200055 |
03-Dec-2009 |
ume |
Teach an IPv6 to the debug prints.
|
200040 |
02-Dec-2009 |
luigi |
- initialize src_ip in the main loop to prevent a compiler warning (gcc 4.x under linux, not sure how real is the complaint). - rename a macro argument to prevent name clashes. - add the macro name on a couple of #endif - add a blank line for readability.
MFC after: 3 days
|
200029 |
02-Dec-2009 |
luigi |
small changes for portability and diff reduction wrt/ FreeBSD 7. No functional differences.
- use the div64() macro to wrap 64 bit divisions (which almost always are 64 / 32 bits) so they are easier to handle with compilers or OS that do not have native support for 64bit divisions;
- use a local variable for p_numbytes even if not strictly necessary on HEAD, as it reduces diffs with FreeBSD7
- in dummynet_send() check that a tag is present before dereferencing the pointer.
- add a couple of blank lines for readability near the end of a function
MFC after: 3 days
|
200027 |
02-Dec-2009 |
ume |
Teach an IPv6 to send_pkt() and ipfw_tick(). It fixes the issue which keep-alive doesn't work for an IPv6.
PR: kern/117234 Submitted by: mlaier, Joost Bekkers <joost__at__jodocus.org> MFC after: 1 month
|
199073 |
09-Nov-2009 |
oleg |
style(9): add missing parentheses
|
198845 |
03-Nov-2009 |
oleg |
Fix two issues that can lead to exceeding configured pipe bandwidth: - do not expire queues which are not ready to be expired. - properly calculate available burst size.
MFC after: 3 days
|
197952 |
11-Oct-2009 |
julian |
Virtualize the pfil hooks so that different jails may chose different packet filters. ALso allows ipfw to be enabled on on ejail and disabled on another. In 8.0 it's a global setting.
Sitting aroung in tree waiting to commit for: 2 months MFC after: 2 months
|
196453 |
23-Aug-2009 |
julian |
Fix another typo right next to the previous one, that amazingly, I did not see before.
MFC after: 1 week
|
196451 |
23-Aug-2009 |
julian |
Fix typo in comment that has been bugging me for days.
MFC after: 1 week
|
196423 |
21-Aug-2009 |
julian |
Fix ipfw's initialization functions to get the correct order of evaluation to allow vnet and non vnet operation. Move some functions from ip_fw_pfil.c to ip_fw2.c and mode to mostly using the SYSINIT and VNET_SYSINIT handlers instead of the modevent handler. Correct some spelling errors in comments in the affected code. Note this bug fixes a crash in NON VIMAGE kernels when ipfw is unloaded.
This patch is a minimal patch for 8.0 I have a much larger patch that actually fixes the underlying problems that will be applied after 8.0
Reviewed by: zec@, rwatson@, bz@(earlier version) Approved by: re (rwatson) MFC after: Immediatly
|
196322 |
17-Aug-2009 |
jhb |
Purge mergeinfo in sys/ that is either empty or a subset of the parent mergeinfo on sys/ itself.
Approved by: re (mergeinfo blanket)
|
196201 |
14-Aug-2009 |
julian |
Fix ipfw crash on uid or gid check. Receiving any ip packet for which there is no existing socket will crash if ipfw has a uid or gid test rule, as the uid/gid of the non existent owner of said non existent socket is tested. Brooks introduced this error as part of his >16 gids patch. It appears to be a cut-n-paste error from similar code a few lines before. The old code used the 'pcb' variable here, but in the new code that switched the 'inp' variable, which is often NULL and what is tested in the code further up. The rest of the multi-gid patch for ipfw seems solid (and cleaner than previous code).
Reviewed by: brooks Approved by: re (rwatson)
|
196019 |
01-Aug-2009 |
rwatson |
Merge the remainder of kern_vimage.c and vimage.h into vnet.c and vnet.h, we now use jails (rather than vimages) as the abstraction for virtualization management, and what remained was specific to virtual network stacks. Minor cleanups are done in the process, and comments updated to reflect these changes.
Reviewed by: bz Approved by: re (vimage blanket)
|
195923 |
28-Jul-2009 |
julian |
Startup the vnet part of initialization a bit after the global part. Fixes crash on boot if ipfw compiled in.
Submitted by: tegge@ Reviewed by: tegge@ Approved by: re (kib)
|
195862 |
25-Jul-2009 |
julian |
Catch ipfw up to the rest of the vimage code. It got left behind when it moved to its new location.
Approved by: re (kensmith)
|
195727 |
16-Jul-2009 |
rwatson |
Remove unused VNET_SET() and related macros; only VNET_GET() is ever actually used. Rename VNET_GET() to VNET() to shorten variable references.
Discussed with: bz, julian Reviewed by: bz Approved by: re (kensmith, kib)
|
195699 |
14-Jul-2009 |
rwatson |
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator (DPCPU), as suggested by Peter Wemm, and implement a new per-virtual network stack memory allocator. Modify vnet to use the allocator instead of monolithic global container structures (vinet, ...). This change solves many binary compatibility problems associated with VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also once per virtual network stack. Virtualized global variables are tagged at compile-time, placing the in a special linker set, which is loaded into a contiguous region of kernel memory. Virtualized global variables in the base kernel are linked as normal, but those in modules are copied and relocated to a reserved portion of the kernel's vnet region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the network stack instance is created, and are initialized statically from the reference copy. Run-time access occurs via an accessor macro, which converts from the current vnet and requested symbol to a per-vnet address. When "options VIMAGE" is not compiled into the kernel, normal global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global variables, restores support for non-global symbols and types, eliminates the need for many subsystem constructors, eliminates large per-subsystem structures that caused many binary compatibility issues both for monitoring applications (netstat) and kernel modules, removes the per-function INIT_VNET_*() macros throughout the stack, eliminates the need for vnet_symmap ksym(2) munging, and eliminates duplicate definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz Reviewed by: bz, zec Discussed with: gnn, jamie, jeff, jhb, julian, sam Suggested by: peter Approved by: re (kensmith)
|
195023 |
26-Jun-2009 |
rwatson |
Update various IPFW-related modules to use if_addr_rlock()/ if_addr_runlock() rather than IF_ADDR_LOCK()/IF_ADDR_UNLOCK().
MFC after: 6 weeks
|
194930 |
24-Jun-2009 |
oleg |
- fix dummynet 'fast' mode for WF2Q case. - fix printing of pipe profile data. - introduce new pipe parameter: 'burst' - how much data can be sent through pipe bypassing bandwidth limit.
|
194498 |
19-Jun-2009 |
brooks |
Rework the credential code to support larger values of NGROUPS and NGROUPS_MAX, eliminate ABI dependencies on them, and raise the to 1024 and 1023 respectively. (Previously they were equal, but under a close reading of POSIX, NGROUPS_MAX was defined to be too large by 1 since it is the number of supplemental groups, not total number of groups.)
The bulk of the change consists of converting the struct ucred member cr_groups from a static array to a pointer. Do the equivalent in kinfo_proc.
Introduce new interfaces crcopysafe() and crsetgroups() for duplicating a process credential before modifying it and for setting group lists respectively. Both interfaces take care for the details of allocating groups array. crsetgroups() takes care of truncating the group list to the current maximum (NGROUPS) if necessary. In the future, crsetgroups() may be responsible for insuring invariants such as sorting the supplemental groups to allow groupmember() to be implemented as a binary search.
Because we can not change struct xucred without breaking application ABIs, we leave it alone and introduce a new XU_NGROUPS value which is always 16 and is to be used or NGRPS as appropriate for things such as NFS which need to use no more than 16 groups. When feasible, truncate the group list rather than generating an error.
Minor changes: - Reduce the number of hand rolled versions of groupmember(). - Do not assign to both cr_gid and cr_groups[0]. - Modify ipfw to cache ucreds instead of part of their contents since they are immutable once referenced by more than one entity.
Submitted by: Isilon Systems (initial implementation) X-MFC after: never PR: bin/113398 kern/133867
|
194245 |
15-Jun-2009 |
oleg |
Since dn_pipe.numbytes is int64_t now - remove unnecessary overflow detection code in ready_event_wfq().
|
193896 |
10-Jun-2009 |
luigi |
in ip_dn_ctl(), do not allocate a large structure on the stack, and use malloc() instead if/when it is necessary.
The problem is less relevant in previous versions because the variable involved (tmp_pipe) is much smaller there. Still worth fixing though.
Submitted by: Marta Carbone (GSOC) MFC after: 3 days
|
193894 |
10-Jun-2009 |
luigi |
small simplifications to the code in charge of reaping deleted rules: - clear the head pointer immediately before using it, so there is no chance of mistakes; - call reap_rules() unconditionally. The function can handle a NULL argument just fine, and the cost of the extra call is hardly significant given that we do it rarely and outside the lock.
MFC after: 3 days
|
193859 |
09-Jun-2009 |
oleg |
Close long existed race with net.inet.ip.fw.one_pass = 0: If packet leaves ipfw to other kernel subsystem (dummynet, netgraph, etc) it carries pointer to matching ipfw rule. If this packet then reinjected back to ipfw, ruleset processing starts from that rule. If rule was deleted meanwhile, due to existed race condition panic was possible (as well as other odd effects like parsing rules in 'reap list').
P.S. this commit changes ABI so userland ipfw related binaries should be recompiled.
MFC after: 1 month Tested by: Mikolaj Golub
|
193744 |
08-Jun-2009 |
bz |
After r193232 rt_tables in vnet.h are no longer indirectly dependent on the ROUTETABLES kernel option thus there is no need to include opt_route.h anymore in all consumers of vnet.h and no longer depend on it for module builds.
Remove the hidden include in flowtable.h as well and leave the two explicit #includes in ip_input.c and ip_output.c.
|
193532 |
05-Jun-2009 |
luigi |
move kernel ipfw-related sources to a separate directory, adjust conf/files and modules' Makefiles accordingly.
No code or ABI changes so this and most of previous related changes can be easily MFC'ed
MFC after: 5 days
|