Skip to content
Snippets Groups Projects
commit_messages.txt 22.9 KiB
Newer Older
Jonathan Schöbel's avatar
Jonathan Schöbel committed

Classes:
Data:
	This class will handle Data needed by the CMS, provides storage
	for modules, manages the database connection and maybe also
	contains some caches. At the moment it only provides access to
	the Validator.
	The two predicates SH_Data_check_tag and SH_Data_check_attr are
	wrappers to the appropriate methods of the validator. These are
	needed, as there shouldn't be direct calls to the internal
	structure of SH_Data.
	The modifying methods are not exposed, as the validator
	shouldn't be changed while others depend on it, this has to be
	implemented later.
	Data also contains a wrapper for the self-closing tag predicate.
Attr:
	The structure SH_Attr implements an HTML Attribute.
	For every function there is also a static method/function,
	which can perform the same work, but doesn't rely on really
	having a single struct Attr. This is useful for example in an
	array to manipulate a single element.


@subsubsection NodeFragment - Attributes

	The method SH_NodeFragment_get_attr provides a pointer to an Attr, by
	its index. Note, that it directly points to the internal data, instead
	of copying the data to a new Attr, which would be unneccessary overhead,
	if only reading access is needed. That's why it is also a const pointer.
	If the user intends to modify it, a copy should be taken via
	SH_Attr_copy.

	Multiple insert methods allow either to add an existing Attr, or to
	create a new one implicitly. If the Attr is not already used beforehand,
	it is more efficient to call the attr_new methods. Also an old Attr is
	freed, after it was inserted, thus it can't be used afterwards. This is
	neccessary, as for efficiency reasons an array of Attr is used directly,
	instead of the indirect approach of storing a pointer of Attr. This
	means, that the contents of the Attr has to be copied to the internal
	structure. If the old Attr would be left unfreed, there would be two
	Attrs, the original one and the implicit one, referring to the same
	data, which would lead to at least data corruption, or undefined
	behaviour like a double free, which would be a serious threat for a
	library which is to be used on a webserver. ...
	For each of the two insert modes, there is a method to prepend, append
	or insert at a specific position. An incorrect position is handled
	inside of the external method and an E_VALUE is thrown. The internal
	method doesn't handle this, so special care must be taken to not make
	undefined behaviour. However enforcing this check would be unneccessary
	overhead for the prepend and append methods, which are known to have
	correct indicies, as well for other internal methods, where the internal
	method may be used.

	Two alternatives are provided: remove_attr and pop_attr. While the
	former free's the Attr's data, the latter allocates a new Attr, to store
	and return the data. Both functionality is provided by a single
	(internal) static method.

	@subsubsection Childs

	A Fragment can contain childs. When building the html, the
	childs html is generated where appropiate.
	The methods
	- SH_Fragment_get_child (by index)
	- SH_Fragment_is_child (non recursive) and
	- SH_Fragment_is_descendant (recursive)
	were added.
	Fragment can be copied, either recursive (copying also all
	childs) or nonrecursive (ignoring the childs, thus the copy
	has always no childs).
	Adding the same element twice in the tree (graph) isn't
	possible, as this would lead to problems e.g. double free or
	similar quirks.

	The single method (formerly SH_NodeFragment_append_child) to add a child
	at the end of the child list was replaced, by a bunch of methods to
	insert a child at the beginning (SH_NodeFragment_prepend_child), at the
	end (SH_NodeFragment_append_child), at a specific position
	(SH_NodeFragment_insert_child) and directly before
	(SH_NodeFragment_insert_child_before) or after another child
	(SH_NodeFragment_insert_child_after). All these methods are implemented
	by a single internal one (insert_child), as there isn't really much
	difference in inserting one or the other way.
	But this internal method doesn't check whether this insertion request is
	actually doable, to save overhead as not every insertion method requires
	this check. This is done by the respective method. However if the check
	is not done correctly the internal method will attempt to write at not
	allocated space, which will hopefully result in a segfault.

	The child list is implemented as an array. To reduce the overhead to
	realloc calls, the array is allocated in chunks of childs. The
	calculation how many has to be allocated is done by another static
	method and determined by the macro CHILD_CHUNK. This is set to 5, which
	is just a guess. It should be somewhere around the average number of
	childs per html element, to reduce unused overhead.

	Also some predicates (SH_NodeFragment_is_parent,
	SH_NodeFragment_is_ancestor) were added to check whether a relationship
	exists between to nodes, thus whether they are linked through one or
	multiple levels. These functions could replace the old ones
	(SH_NodeFragment_is_child, SH_NodeFragment_is_descendant) semantically.
	Furthermore they are more efficient as this is now possible to check
	over the parent pointer. The internal insert method also uses these
	methods to check whether the child node is actually a parent of the
	parent node, which would result in errors later one.

	The old test is now obsolete but remained, as it is not bad to test
	more.

	Various remove methods were added, which are all implemented by an
	static method, analog to the insert methods.


	@subsubsection Misc
	A Fragment can output it's html. If there is an error the method
	aborts and returns NULL.
	This method also pays attention to self-closing tags, which is
	determined via the validator.
	When the wrap mode is used, after each tag a newline is started.
	Also the html is indented, which can be configured by the
	parameters indent_base, indent_step and indent_char. The
	parameter indent_base specifies the width the first tag should
	be indented with, while indent_step specifies the increment of
	the indent when switching to a child tag. The character, that is
	used for indenting is taken from indent_char. (It could also be
	a string longer than a single character).
	This arguments can't be set by the user, but are hardcoded
	(by now).
	The to_html method generates also the html for the attributes.
	Note, that there is no escaping of the quotes, the values are
	wrapped with. But this is also somewhat consistent, as there is
	no syntax validation on the tags either.
	(i.e. no '<' inside of a tag)

	The TextFragment is used to implement the text between and
	outside html tags. Currently, it is still very rudimentary in,
	that it doesn't support any operations at all and just has a
	function to expose a internal text.
	While this function is necessary to manipulate the content of a
	TextFragment, the TextFragment should abstract the semantics of
	Text. While simple wrapper functions for appending are to be
	added, methods purely manipulating the text, i.e. relying on
	the text's contents, wont get wrapper functions. Thus this
	function is still needed until a more sophisticated approach is
	implemented.
	Some basic text functionality is already supported via wrapper
	functions.
	Note that wrapper functions aren't tested in unit tests.

	When a newline is encountered in the text, a <br /> is inserted
	and for wrap mode also a newline and an indent is inserted.
	Note, that the indent is still missing at the front where it
	can't be inserted yet as SH_Text is still lacking basic
	functionality.


	The html generation for both TextFragment and NodeFragment
	combined is tested. As the encoding semantics of the
	TextFragments are neither defined nor implemented, some tests
	are marked as XFAIL.


	What is still missing is the proper treatment of embed text.
	This should be indented and breaked at 72/79/80. Also newlines
	and special chars should be replaced on generation, maybe also
	giving some way of preventing XSS. Regarding the NodeFragment
	there should be some adjustments to further adjust the styling,
	which of course should also be reflected by TextFragment. This
	should also include the generation of self-closing tags.
	Furthermore the html generation should be based on a single
	text object, to which is added to. This will later on also
	enable to directly send generated parts over the network while
	still generating some data.

Validator:
	Validator serves as an syntax checker, i.e. it can be requested
	whether a tag is allowed.
	On initialization (of data), the Validator's knowledge is filled
	with some common tags. This is of course to be replaced later,
	by some dynamic handling.
	When a tag is made known to the Validator, which it already
	knows, the old id is returned and nothing is added.

	The Validator saves the tags as an array. Now also another information
	is added, which slots aren't used currently to spare expensive calls to
	realloc. This led to a mere reimplementation of the functions. Tags
	can't be deleted by now, but the adding function supports reusing empty
	slots. Also the reading functions have to determine, whether a slot can
	be read or is empty.
	The tests were adjusted, but are buggy, so they should be rewritten in
	the future.

	A registered tag can be deregistered by calling SH_Validator_deregister.
	The data is removed, but the space is not deallocated, if it is not at
	the end. This prevents copying data on removal and saves expensive calls
	to realloc. Instead the empty space is added to the list of free blocks,
	which allows to refill these spaces, if a new tag is being registered.
	The space is finally deallocated, if the validator is being deallocated
	or the tag written in the last block is removed. In this case, heavy
	iteration is performed, as the list of free blocks is not ordered. The
	next last tag at that time is determined by iterating over the list of
	free blocks until some it is not found.
	Note that even if there can be a lot of gaps in between, the Validator
	will not allocate more space until all these gaps are refilled when a
	new tag is registered, thus new space is only being allocated, if there
	is really not enough space left.
	Due to the 4 nested loops, there was an issue related to the
	72(80)-column rule. It can't be abided without severely impacting the
	readability of the code.

	Originally the ids were intended to be useful for linking different
	information together internally, and for providing references
	externally. However, they weren't used internally, for this, pointers
	seamed to be more useful, as they also allow to directly access the data
	and also have a relation defined.
	Regarding reference purposes, they aren't really needed, and it is more
	convenient to directly use some strings, and they aren't more
	performant, as there still have to be internal checks and looking for an
	int isn't more performant, then looking for a pointer.
	Also, they have to be stored, so they need more memory and also some
	code, to be handled.

	While it was very clever, the complex data structure of the tag array
	introduced in 'Validator: restructured internal data (a0c9bb2)' comes
	with a lot of runtime overhead. It reduces the calls to free and
	realloc, when a lot of tags are deleted and inserted subsequently, but
	burdens each call with a loop over the linked list of free blocks.

	This is even more important, as validator must be fast in checking, as
	this is done every time something is inserted into the DOM-tree, but has
	not so tight requirements for registering new tags, as this is merely
	done at startup time.

	As the access must be fast, the tags are sorted when inserted, so that
	the search can take place in log-time.

	There is a method to add a set of tags to a validator on initialisation.
	First this removes a user application from the burden of maintaining the
	html spec and also is more performant, as a lot of tags are to be
	inserted at once, so there aren't multiple allocation calls.
	As the validator needs the tags to be in order, the tags must be sorted
	on insertion. Of course it would be easier for the code, if the tags
	were already in order, but first there could be easily a mistake and
	second sorting the tags by an algorithm allows the tags to be specified
	in a logically grouped and those more maintainable order.
	For the sorting, insertion sort is used. Of course it has a worse
	quadratic time complexity, but in a constructor, I wouldn't introduce
	the overhead of memory managment a heap- or mergesort would introduce
	and in-place sorting is also out, because the data lies in ro-memory.
	Thus I choose an algorithm with constant space complexity. Also the
	'long' running time is not so important, as the initilization only runs
	at startup once and the tags are not likely to exceed a few hundred so
	even a quadratic time isn't that bad.

	Each tag has a type as defined by the html spec. This must be provided
	on registration. Implicitly registering tags, when an attribute is
	registered can't be done anymore, as the type information would be
	missing.
	The added parameterin register_tag, as well as the change of behaviourin
	register_attr has broken a lot of tests, that had to be adjusted
	therefor.

	Added self-closing predicate. Other predicates may follow.

	The Validator contains already all HTML5 tags.
	Tags according to:
	https://html.spec.whatwg.org/dev/indices.html#elements-3

	Types according to:
	https://html.spec.whatwg.org/multipage/syntax.html#elements-2

	Retrieved 04. 10. 2023


	A attribute can be deregistered by calling SH_Validator_deregister_attr.
	Note that deregistering an attr, that was never registered is considered
	an error, but this may change, as technically it is not registered
	afterwards and sometimes (i.e. for a blacklist) it might be preferable
	to ensure, that a specific attr is not registered, but it is not clear
	whether there should be an error or not.
	Also the deallocating of the data used for an attr was moved to an extra
	method, as this is needed in several locations and it might be subject
	to change.

	The Validator can check if a attribute is allowed in a tag. It does so
	by associating allowed tags with attributes. This is done in that way,
	to support also attributes which are allowed for every tag (global
	attributes), but this is not yet supported. So some functions allow for
	NULL to be passed and some will still crash.

	The predicate SH_Validator_check_attr returns whether an attribute is
	allowed for a specific tag. If tag is NULL, it returns whether an attr
	is allowed at all, not whether it is allowed for every tag. For this
	another predicate will be provided, when this is to be implemented.

	The method SH_Validator_register_attr registers an tag-attr combination.
	Note, that it will automatically call SH_Validator_register_tag, if the
	tag doesn't exist. Later it will be possible, to set tag to NULL to
	register a global attribute, but for now the method will crash.

	The method SH_Validator_deregister_attr removes a tag-attr combination
	registered earlier. Note, that deregistering a non existent combination
	will result in an error. This behaviour is arguable and might be subject
	to change. When setting only tag to NULL, all tags for this attribute
	are deregistered. When setting only attr to NULL, all attrs for this tag
	are deregistered. This might suffer from problems, if this involves some
	attrs, that are global. Also this will use the internal method
	remove_tag_for_all_attrs, which has the problem, that it might fail
	partially. Normally when failing all functions revert the program to the
	same state, as it was before the call. This function however is
	different, as if it fails there might be some combinations, that haven't
	been removed, but others are already. Nevertheless, the validator is
	still in a valid state, so it is possible to call this function a second
	time, but it is not sure, which combinations are already deregistered.

	As the attrs also use the internal strings of the tags, it must be
	ensured, when a tag is deregistered, that all remaining references are
	removed, otherwise there would be dangling pointers. Note, that for this
	also remove_tag_for_all_attrs is used, so the method
	SH_Validator_deregister_tag suffers from the same problems listed above.
	Also if this internal method fails, the tag won't be removed at all.

	Similar to the tags, the attributes can be initialized. Missing tags are
	automatically added. The declaration syntax is currently a bit annoying,
	as the tags, that belong to an attribute, either have to be declared
	explicitly or a pointer to the tag declaration must be given, but then
	only concurrent tags are possible.
	Support for global attributes is likewise missing; it must be ensured,
	that (tag_n != 0) && (tags != NULL). Otherwise validator will be
	inconsistent and there might be a bug.

	Global attributes are represented by empty attributes. A global
	attribute is an attribute, that is accepted for any tag.
	It is refused to remove a specific tag for a global attribute, as this
	would mean to "localize" the tag, thus making it not global anymore.
	The method to do that and a predicate for globalness is missing yet.

	Deregistering a global attribute normally is not possible, as basically
	every other tag has to be added. This was implemented now.
	Originally it was intended to provide the caller with the information,
	that a global attribute has to be converted into a local one before
	removal. However such internals should not be exposed to the caller. As
	it stands there is no real reason to inform a caller, whether an
	attribute is local or global. Also, there is a problem that the
	predicate is burdened with the possibility, that the attribute doesn't
	exists, thus it can't return a boolean directly. Both is why, the
	predicate isn't added yet.
	Also a bug was detected in the method remove_tag_for_all_attrs. It
	removes an attribute while also iterating over it, thus potentially
	skipping over some attribute and maybe also invoking undefined behaviour
	by deallocating space after the array.


	Copying a Validator could be useful if multiple html versions are to be
	supported. Another use case is a blacklist XSS-Scanner.

Text:
	This is a data type to deal with frequently appending to a string.
	The space a Text has for saving the string is allocated in chunks.
	To request additional space SH_Text_enlarge is called. If the
	requested size fits inside the already allocated space or is even
	smaller than the current size, nothing is done. Otherwise a
	multiple of chunk size is allocated being equal or greater than
	the requested size. The chunk size can be changed by changing
	the macro CHUNK_SIZE in src/text.h. The default is 64.
	The adjustment is done automatically when a string is added.
	SH_Text_append_string can be used to append a string to the text,
	SH_Text_append_text can be used to append another text to the text.
	SH_Text_join is a wrapper for SH_Text_append_text, but also frees
	the second text, thus joining the texts to a single one.

	The constructor SH_Text_new_from_string accepts a string, with that the
	text is initialized. This can replace the so far needed two calls
	SH_Text_new and SH_Text_append_string.

	The (intern) implementation of SH_Text was changed from an array of
	char, to a single linked list of arrays of char. This allows an easier
	implementation of (further) text manipulation.

	The API hasn't changed much, but SH_Text_join can't yield an error
	anymore, so it now doesn't support passing an error and returns nothing.
	The method SH_Text_get_char returns a single character by a given index.
	If the index is out of range, NULL is returned and error->type is set to
	VALUE_ERROR.

	The function SH_Text_get_string returns a substring of text beginning at
	index and of length offset. If index is out of bounds, NULL is returned
	and an error is set. If offset is out of bounds, the existent part is
	returned. Also the length of the returned string can be set (optionally)
	to the out parameter length.

	If the original behaviour of SH_Text_get_string is achieved,
	SH_Text_get_string (text, length, error) has to be changed to
	SH_Text_get_string (text, 0, SIZE_MAX, length, error). The only
	difference will be that the function won't fail, when the text is longer
	than SIZE_MAX, because it is told to stop there. A text that is longer
	than SIZE_MAX is not possible to be returned, but that wasn't possible
	at anytime. Also I don't think handling char[] longer than SIZE_MAX is
	possible with the standard C library. Those in this case the text can
	only be returned in parts (By now only possible till 2*SIZE_MAX-1 with
	calling SH_Text_get_string (text, SIZE_MAX, SIZE_MAX, length, error))
	or has to be manipulated using the appropriate SH_Text methods, which are
	not implemented yet.

	The function SH_Text_get_range returns a string beginning at start and
	ending at end. Note that end specifies the char, that is not returned
	any more. Thus the function implements something similar, as the pythonic
	slice syntax (text[start:end]). In opposition to the behaviour there,
	calling SH_Text_get_range with start > end is undefined behaviour. If
	start == end, the empty string is returned.
	If start is out of bounds, NULL is returned and an error is set. If end
	is out of bounds, the existent part is returned. Also the length of the
	returned string can be set (optionally) to the out parameter length.
	The function SH_Text_get_length returns the length of the text. As the
	text also supports being longer than SIZE_MAX, this method can fail on
	runtime. If the text is longer then SIZE_MAX, the Text returns SIZE_MAX
	and sets error to DOMAIN_ERROR. Note, that due to the implementation,
	this is a non trivial function, so don't use it to exhaustively.
	The method SH_Text_print just prints the whole string to stdout.

	The function SH_Text_set_char allows to write a single character to a
	position, that already exists in the text. Thus overwriting another
	character. If the index is out of range, a value error is set and FALSE
	is returned.

	It was tried to implement the text in terms of multiple text
	segments.

	While it would be preferable, it doesn't seam to be possible to
	abstract over the internals of text_segment. That's why only
	some basic functionality is moved, but whether more is to
	follow, is not known yet.

	A text_segment allocates memory in terms of chunks, this is now
	also done, when created from a string, but this means that we
	can't rely on strdup any more, as it takes care of the
	allocation. Calling malloc ourselves shouldn't be such an
	overhead as at least glibc's strdup performs the exact same
	steps. Actually we should be spare a strlen call now, so it
	should be more performant.

	The copy_and_replace function replaces a single character with
	a string, while copying. This may be replaced by an elaborate
	function as manipulating a text normally means that
	manipulating is deferred until needed, which this function
	contradicts to.


	Also there is the concept of a text mark.
	A mark will be used to point to a specific location inside of a
	text. Currently it can't do anything and isn't even used.